Which Registers Are Floating Point Registers In Arm

My mantra is *not* to use any floating point data types in embedded applications, or at least to avoid them whenever possible: for most applications, they are not necessary and can be replaced by fixed indicate operations. Not only floating point operations have numerical problems, but they tin can also pb to performance problems as in the following (simplified) example:

          #ascertain NOF  64 static uint32_t samples[NOF]; static bladder Fsamples[NOF]; float fZeroCurrent = 8.0;  static void ProcessSamples(void) {   int i;    for (i=0; i&lt;NOF; i++) {     Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;   } }

ARM designed the Cortex-M4 compages in a way it is possible to have an FPU added. For example, the NXP ARM Cortex-M4 on the FRDM-K64F board has an FPU present.

MK64FN1M0VLL12 on FRDM-K64F

The question is: how long will that function need to perform the operations?

Looking at the loop, it does:

          Fsamples[i] = samples[i]*3.three/4096.0 - fZeroCurrent;

Which is to load a 32-chip value, then perform a floating point multiplication, followed by a floating betoken division and floating indicate subtraction, then store the result back in the result assortment.

The NXP MCUXpresso IDE has a absurd feature showing the number of CPU cycles spent (see Measuring ARM Cortex-M CPU Cycles Spent with the MCUXpresso Eclipse Registers View). So, running that function (without any special optimization settings in the compiler takes:

Cycle Delta

Bicycle Delta

0x4b9d or 19'357 CPU cycles for the whole loop. Measuring only one iteration of the loop takes 0x12f or 303 cycles, 1 might wonder why it takes such a long time, as nosotros do have an FPU?

The respond is in the assembly code:

This actually shows that it does non utilize the FPU simply, instead, uses software floating point operations from the standard library?

The answer is the way the operation is written in C:

          Fsamples[i] = samples[i]*3.3/4096.0 - fZeroCurrent;

We have hither a uint32_t multiplied with a floating indicate number:

          samples[i]*three.3

The thing is that a constant as 'iii.3' in C is of blazon *double*. Every bit such, the performance will first convert the uint32_t to a double, and and then perform the multiplication every bit double operation.
Same for the partitioning and subtraction: it volition be performed as double operation:

          samples[i]*3.3/4096.0

Same for the subtraction with the float variable: because the left operation result is double, it has to be performed equally double operation.

          samples[i]*3.three/4096.0 - fZeroCurrent

Finally, the event is converted from a double to a float to store it in the array:

          Fsamples[i] = samples[i]*3.three/4096.0 - fZeroCurrent;

Now, the library routines chosen should be clear in the above assembly lawmaking:

__aeabi_ui2d: convert unsigned int to double
__aeabi_dmul: double multiplication
__aeabi_ddiv: double partitioning
__aeabi_f2d: float to double conversion
__aeabi_dsub: double subtraction
__aeabi_d2f: double to float conversion

But why is this washed in software and not in hardware, as we have an FPU?

The answer is that the ARM Cortex-M4F has merely a *unmarried precision* (float) FPU, and not a double precision (double) FPU. As such, it only tin practice float operations in hardware simply not for double type.

The solution, in this case, is to use float (and not double) constants. In C, the 'f' suffix can be used to mark constants as bladder:

          Fsamples[i] = samples[i]*3.3f/4096.0f - fZeroCurrent;

With this, the code changes to this:

Using Single Precision FPU Instructions

Then at present, it is using unmarried precision instructions of the FPU. This only takes 0x30 (48) cycles for a single iteration or 0xc5a (3162) for the whole thing: 6 times faster.

The example can be fifty-fifty further optimized with:

          Fsamples[i] = samples[i]*(3.3f/4096.0f) - fZeroCurrent;

Other Considerations

Using bladder or double is not bad per se: it all depends on how information technology is used and if they are really necessary. Using fixed-point arithmetics is not without problems, and standard sin/cos functions employ double, so you don't want to re-invent the cycle.

CENTIVALUES

One way to use a float blazon say for a temperature value:

          floattemperature; /* e.g. -37.512 */

Instead, it might be a better idea to use a 'centi-temperature' or 'milli' integer variable type:

          int32_t centiTemperature; /* -37512 corresponds to -37.512 */

That style, normal integer operations tin be used.

GCC Single PRECISION CONSTANTS

The GNU GCC compiler offers to care for double constants as 3.0 equally single precision constants (3.0f) using the following option:

-fsingle-precision-abiding causes floating-point constants to be loaded in unmarried precision fifty-fifty when this is not exact. This avoids promoting operations on single precision variables to double precision like in x + one.0/3.0. Annotation that this as well uses single precision constants in operations on double precision variables. This tin can improve performance due to less retention traffic.

Come across https://gcc.gnu.org/wiki/FloatingPointMath

RTOS

The other consideration is: if using the FPU, it means potentially stacking more registers. This is a possible performance problem for an RTOS like FreeRTOS (come across https://www.freertos.org/Using-FreeRTOS-on-Cortex-A-Embedded-Processors.html). The ARM Cortex-M4 supports a 'lacy stacking' (see https://stackoverflow.com/questions/38614776/cortex-m4f-lazy-fpu-stacking). So, if the FPU is used, information technology means more stacked registers. If no FPU is used, then it is ameliorate to select the M4 port in FreeRTOS:

M4 and M4F in FreeRTOS

Summary

I recommend not to utilize any float and double data types if not necessary. And if you have an FPU, pay attention if it is only a single precision FPU, or if the hardware supports both single and double precision FPU. If having a unmarried precision FPU only, using the 'f' suffix for constants and casting things to (bladder) can make a big departure. But keep in heed that float and double have different precision, so this might not solve every problem.

Happy Floating!

PS: If in need for a double precision FPU, have a await at the ARM Cortex-M7 (eastward.yard. First steps: ARM Cortex-M7 and FreeRTOS on NXP TWR-KV58F220M or Kickoff Steps with the NXP i.MX RT1064-EVK Board)