Latest from Embedded

ID_316508515_alena_butusava, and Macronix
promo__id_316508515__alena_butusava__dreamstime
ID 391201287 © Sf1nks | Dreamstime.com
design_dreamstime_l_391201287
William Wong/Endeavor Business Media
promo__vita_93_qmc__william_wong
ID 316508515 © Alena Butusava - Dreamstime.com
Brainchip Platform Uses Spiking Neural Networks for Low Power Operations
Dreamstime_Eugenesergeev_215838205
dreamstime_eugenesergeev_215838205
76795646 © Cybrain | Dreamstime.com
promo_cybrain_dreamstime_xxl_76795646

What’s the Difference Between Fixed-Point, Floating-Point, and Numerical Formats? (.PDF Download)

Aug. 31, 2017
What’s the Difference Between Fixed-Point, Floating-Point, and Numerical Formats? (.PDF Download)

Embedded C and C++ programmers are familiar with signed and unsigned integers and floating-point values of various sizes, but a number of numerical formats can be used in embedded applications. Here we take a look at all of these formats and where they might be found.

One reason for examining different formats is to understand how they work and where they can be applied. For example, fixed-point values can often be used when floating-point support isn’t available. Fixed point may be preferable in some instances, while floating-point support is available for other reasons, such as precision or representation.

Developers may be using single- and double-precision IEEE 754 standard formats, but what about 16-bit half precision or even 8-bit floating point? The latter is being used in deep neural networks (DNNs), where small values are useful. Small integers and fixed point can be used with DNN weights as well, depending on the application and hardware.

There are a variety number of ways to represent numbers. However, the layouts tend to vary only in the number of bits involved (see the figure). The use of the sign bit in binary-encoded values differs depending on whether 1’s or 2’s complement encoding is used. The 1’s complement approach uses the same encoding for the integer portion, which means there is actually a positive and negative zero value. A 2’s complement number has a single zero value, but there’s one more negative value than positive value. For example, an 8-bit signed integer includes values −128 to −1, 0 and 1 to 127.