Latest from Embedded

144516710_Vladimir_Timofeev_Dreamstime
promo__id_144516710__vladimir_timofeev__dreamstime
ID 84308884 © Andy Chisholm - Dreamstime.com
promo_id_84308884__andy_chisholm__dreamstime
Dreamstime_Monsit-Jangariyawong_117103442
dreamstime_monsitjangariyawong_117103442
Tony Vitolo/Electronic Design
promo1920x1080
ID 83317721 © Igor Zakharevich | Dreamstime.com
supplychain_dreamstime_l_83317721

What’s the Difference Between Fixed-Point, Floating-Point, and Numerical Formats? (.PDF Download)

Aug. 31, 2017
What’s the Difference Between Fixed-Point, Floating-Point, and Numerical Formats? (.PDF Download)

Embedded C and C++ programmers are familiar with signed and unsigned integers and floating-point values of various sizes, but a number of numerical formats can be used in embedded applications. Here we take a look at all of these formats and where they might be found.

One reason for examining different formats is to understand how they work and where they can be applied. For example, fixed-point values can often be used when floating-point support isn’t available. Fixed point may be preferable in some instances, while floating-point support is available for other reasons, such as precision or representation.

Developers may be using single- and double-precision IEEE 754 standard formats, but what about 16-bit half precision or even 8-bit floating point? The latter is being used in deep neural networks (DNNs), where small values are useful. Small integers and fixed point can be used with DNN weights as well, depending on the application and hardware.

There are a variety number of ways to represent numbers. However, the layouts tend to vary only in the number of bits involved (see the figure). The use of the sign bit in binary-encoded values differs depending on whether 1’s or 2’s complement encoding is used. The 1’s complement approach uses the same encoding for the integer portion, which means there is actually a positive and negative zero value. A 2’s complement number has a single zero value, but there’s one more negative value than positive value. For example, an 8-bit signed integer includes values −128 to −1, 0 and 1 to 127.