## What are your criteria for when to use fixed-point versus floating-point arithmetic in your embedded design?

Wednesday, September 22nd, 2010 by Robert Cravotta
The trade-offs between using fixed-point versus floating-point arithmetic in embedded designs continues to evolve. One set of trades-offs between using either type of arithmetic involves system cost, processing performance, and ease-of-use. Implementing fixed-point arithmetic is more complicated than using floating-point arithmetic on a processor with a floating-point unit. The extra complexity of determining scaling factors for fixed-point arithmetic and accommodating precision loss and overflow, has historically been offset by allowing the system to run on a cheaper processor, and depending on the application, at lower energy consumption and more accuracy than with a processor with an integrated floating-point unit.

However, the cost of on-chip floating-point units has been dropping for years and they crossed a cost threshold over the last few years as signaled by the growing number of processors that include an integrated floating-point unit (more than 20% of the processors listed in the Embedded Processing Directory device tables now include or support floating-point units). In conversations with processor vendors, they have shared with me that they have experienced more success with new floating-point devices than they anticipated, and this larger than expected success has spurred them to plan even more devices with floating-point support.

Please share your decision criteria for when to use fixed-point and/or floating-point arithmetic in your embedded designs. What are the largest drivers for your decision? When does the application volume make the cost difference between two processors drive the decision? Does the energy consumption between the two implementations ever drive a decision one way or the other? Do you use floating-point devices to help you get to market quickly and then migrate to a fixed-point implementation as you ramp up your volumes? As you can see, the characteristics of your application can drive the decision in many ways, so please share what you can about the applications with which you have experience performing this type of trade.

Tags: Fixed-point, Floating-point

This entry was posted
on Wednesday, September 22nd, 2010 at 10:18 am and is filed under Question of the Week.
You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.

Criteria =

1. Data Accuracy

2. Processing Speed

3. Price

4. Delivery time-frame

Just nail down these requirements before doing the system design, and then to FPU or not to FPU is no longer the question.

I have found that FPU is not always better. I designed a geolocation library using FP, and ran into boundary conditions which got so bad it became a show stopper. Switching to 32-bit fixed point dramatically simplified the code, and in that case it ran faster as well.

-D.

Our company is centered around DSP applications, so my comments are focused toward this space. Most of our work is also built around Analog Devices’ DSP processors so this also colors my response. I have worked extensively with both fixed point and floating point DSPs.

Many of the fixed versus float points have already been raised in Robert’s initial post.

I think many users are too obsessed with component pricing. The cost of the processor is often a small cost when compared to the overall product and development. It probably only starts to become important when volumes are high. If you are comparing device costs, then I think you need to at least look at the supporting circuits at the same time. Most devices need a core supply (1.0 – 1.8V), maybe external RAM or flash, etc. Obviously, the I/O support is a factor as well.

In the past, fixed point DSPs were much less expensive than floating point devices. Today you can buy very good floating point DSPs for under $10 in not extremely high volumes and certainly under $15 in small volumes.

If you were to amortize firmware costs into the DSP cost, my guess is for many applications, floating point DSPs are less expensive than fixed point DSPs. The reason for this is two-fold.

First, most floating point DSPs are also very good fixed point processors as well. For example a SHARC does 32 bit fixed point and floating point in a single cycle. Some algorithms are very easy to code in floating point and much harder in fixed point. FFTs and AGCs are examples. System scaling is also easy with floating point. Other algorithms are best implemented in fixed point. With most floating point devices, you use whatever format makes sense with no real performance cost. This quickly translates into less design time.

It is also possible to emulate floating point operations in a fixed point processor, but the cost in instructions is VERY high. An interesting exercise for C programmers is to look at the assembly code generated with a float operation for a 16 bit fixed point DSP.

A second tradeoff is that a typical fixed point DSP is 16 bits. This often means that double precision math is needed which takes multiple instructions (usually 4). Floating point DSPs usually have longer word lengths. This simplifies coding in assembly and certainly reduces execution time in C or assembly. This is less an issue with FPGA implementations which are almost always fixed point. In this case, the word width is arbitrary, just resource dependent. I realize that this is not specifically a float vs fixed issue, just tends to be true of commercial offerings.

Power consumption is usually less for fixed point. This is changing. We recently started a project with a Blackfin and moved to a new SHARC ADSP-21479 since the power consumption was reasonable. We reduced the core clock to further reduce the power consumption. We estimate that the SHARC will still be as powerful at 100MHz core as a Blackfin running full speed — and the programming is much easier for our application.

I think the main reasons we sometimes use fixed point devices has less to do with signal processing and much more to do with peripheral support. The distinction between GP MCU and DSP is becoming increasing blurred. This especially true for fixed point devices. We tend to use floating point DSPs for signal processing intensive applications (in combination with FPGAs for high speed applications like SDR) and fixed point when we need external communications like USB or Ethernet.

My conclusion is that its worthwhile to recheck your assumptions and reconsider floating point devices

A. C.

http://www.danvillesignal.com

IMO you should almost ALWAYS use an FPU if it’s available. The reason is neither cost nor power, but reliability. For many years I dealt with software written for integer-only processors. It always seemed to me that the most likely source of error was, by far, scaling errors. The programmers would usually (but not always) get the arithmetic right when all the input parameters were within the expected ranges, but strange things would happen when an unexpected value showed up in the computations.

The one exception is the case D. mentions. If you’re measuring angles like latitude and longitude, there can be no overflow problems. If you use “oirad” scaling, then the behavior of the fixed point arithmetic matches that of the angle itself: It wraps around at 180 degrees. Implementation of the trig functions can be tricky, but you only have to do it once, then reuse the software on the next job.

For many application, hardware floating point support is indispensable. But this thread reminds me of when I was asked to speed up some 2-D line graphing code with gobs of floating point math. The result was code that did one 16 bit unsigned integer multiply to get the pixel offset for each y value. One of the 16-bit factors was a constant with an implied decimal point before the most significant bit. It required a complex floating point expression to calculate this constant, but it only had to be done once per graph. This is an extreme case, but the theme comes up fairly often. Hardware floating point can be a panacea, that can improve a bad solution, thus discouraging more careful design leading to a good solution.

I think the decision has to depend, on other things, on what the numbers represent. In a nav system, for example, the raw inputs are clearly integers from the gyros and accels. But the state variables are real-world variables like position and velocity. In the real world, such parameters vary continuously, so the natural computer representation is floating point.

Likewise, if your system is running a chemical plant, the state variables are things like temperature, pressure, Ph number, etc. — all inherently continuously-variable things.

Sure, it’s possible to assign a fixed-point representation for these things. We did it, for years, when we had no alternative. But you’re introducing unnecessary quantization errors in a variable that is inherently continuous. Better, IMO, to use FP internally, and only quantize when the data goes back out to the outside world.

Just one caution when using an FP internally, make sure you’re not sweeping your data accuracy requirements under the rug. It’s all too easy to do so inadvertently, and then it can give you difficult to debug problems somewhere further down the data stream. Ouch!

Even worse, if all your test scenarios happen to pass, you may not even realize you’ve wandered outside the required accuracy somewhere mid stream.

Such doesn’t happen with fixed point, because you have to pick your minimum quanta, which then serves as your de-facto data accuracy.

-D.

Floating point’s benefit over fixed point is that it has a movable radix (i.e. decimal point), not that it is ‘inherently continuous’, or has more innate accuracy.

Matter of fact, given a set range and an equal number of bits, fixed point representation is able to be MORE accurate (and thus more ‘continuous’) than floating point since none of its bits are spent on specifying the exponent and all are spent quantitizing the value in question.

In response to the initial question, I’d currently only consider a processor with floating point if my application demanded it (i.e. floating point calculation was integral to its proper operation and was in the critical calculation path), or it otherwise came ‘for free’ for other design reasons. I personally see this as a very narrow band of applications (until you get to huge APs with 3d cores and even then, FPUs are in the minority)

As for the assertion that more than 20% of the processors now have, or can, support floating point in hardware, or that we’ve reached some sort of price point that its viable in general embedded, I’m unsure where that comes from. The list of processors compiled by your site surely doesn’t have FPU support for 20% (even allowing all of the flavors of Intel, and AMD chips to count as separate). Also, just about every ‘FPU’ enabled device on your list starts at $15 or more and/or requires external SDRAM or other considerable cost drivers. There may be a few outliers in that list, of course, but I didn’t see them with a quick glance through the 193 pages…

The ARM cortexM4 recently announced might make a difference, but its floating point unit (when you look at the specs on the ARM site) is larger in 65nm than the entire M3 core is in 90nm. Depending on how much ‘other’ stuff (like RAMs or NVM) is on an SoC with an M4, it might approach the ‘free’ stage, but I’m guessing it’ll still add a cost that isn’t negligible when compared to other offerings.

“Just one caution when using an FP internally, make sure you’re not sweeping your data accuracy requirements under the rug. It’s all too easy to do so inadvertently, and then it can give you difficult to debug problems somewhere further down the data stream. Ouch!”

D., with respect, I think the situation is exactly the opposite of what you describe. IMX it’s the fixed-point implementation which has by far the greatest risk of accuracy loss, as well as (by far!) the greatest risk of difficult-to-find errors. I’ve paid my dues, spending many hours converting hex numbers to equivalent values in the “real world,” all in an attempt to find a subtle error in scaling. It’s not a pastime I want to repeat.

“Even worse, if all your test scenarios happen to pass, you may not even realize you’ve wandered outside the required accuracy somewhere mid stream.”

That’s right. When you select a scaling for scaled fixed-point, you’d better get it right. If a data element comes in that’s outside your expected range, BAAADDD things can happen. I’ve often wondered that so many life-critical applications (e.g., ICBM’s and fighter planes) are controlled by software so sensitive to scaling errors.

“Floating point’s benefit over fixed point is that it has a movable radix (i.e. decimal point), not that it is ‘inherently continuous’, or has more innate accuracy. ”

Oh, please. Are we going to go there again? Granted, floating point numbers may be an imperfect solution to the representation of an inherently continuous measurment in a digital computer, but they’re still less imperfect than any other. Unless of course you’re recommending going back to analog computers or bizarre variable-length integers in rational fractions.

“Matter of fact, given a set range and an equal number of bits, fixed point representation is able to be MORE accurate (and thus more ‘continuous’) than floating point since none of its bits are spent on specifying the exponent and all are spent quantitizing the value in question.”

That’s true, but only in this sense: A 32-bit integer has more resolution than a 32-bit floating point number, _IFF_ the number you’re working with has a very narrow dynamic range (say, at most, factor of 2). But these days FPUs tend to have FP representations using as many as 80 bits, so the FP number is likely to have much better resolution.

But there’s a much more important point: When designing software to represent real-world numbers, you have to consider the dynamic range of the number, and allocate enough bits for the worst case. This means that, under most conditions where the values are well away from the dynamic limits, you’re bound to have several high-order bits that are all zeroes. Let that happen more than a few times, for every intermediate result in a long series of calculations, and your resolution goes out the window.

We haven’t talked yet about numbers that are being accumulated, as in numerical integrations. This includes the state variables of _ANY_ digital filter in the system.

As the system runs and the filter is operating, its state variable is going to get larger and larger. So bits have to be allocated for that.

Naturally, such problems can be overcome with judicious programming. In the case of the digital filter, for example, you can periodically normalize to get the state variable back in range.

Which means that you’ve just converted an integer computation into a floating point one. Floating point is characterized by a variable exponent, which is exactly the same result you get with normalized fixed point.

much more to the point is that issue of dynamic range. When choosing a representation for a number, you have to look at the largest range it’s likely to have, and allow enou

But if it has wider dynamic range, you must allocate bits to hold it — bits that may often be zero.

If we assume the same word length, then with fixed point you get precision and floating point you get dynamic range. This creates tradeoffs.

One of the practical benefits of floating point processors is that they generally are all good fixed point processors at the same time. The converse is not true since floating point emulation is generally horrible from a instruction point of view.

I would never claim that floating point operations are intrinsically better or more reliable than fixed point. I think that half the signal processing algorithms I have written for SHARC are fixed point implementations.

I am not implying that fixed point is better either. It just depends….

When you are trading off fixed versus floating point operations, you also need to consider implementation of your algorithm.

For example, lets say you are implementing IIR filters using biquads. Fixed point implementations are very well served by direct form 1 using either 1st order error correction or double precision. This takes advantage of the typical double wide accumulator found in most MACs. A floating point implementation does not benefit in the same way. It is better implemented in direct form 2 or direct form 2 transpose..

If you have sufficient dynamic range, scaling, etc it may not matter how you implement your algorithm. If you have low frequency poles near the unit circle, you may be in trouble.

I absolutely agree with Jack that scaling is a big deal. This is maybe the biggest advantage to floating point systems. It tends to take care of scaling issues naturally.

Obviously good design considers these kind of things early. Perhaps too many people pick the processor and code before they think.

This discussion started in part because to some extent, the rules have changed. For a large number of applications, floating point devices are now very cost effective. This makes it much easier to consider technical tradeoffs without automatically deciding that a floating point implementation would be too expensive.

“Oh, please. Are we going to go there again? Granted, floating point numbers may be an imperfect solution to the representation of an inherently continuous measurment in a digital computer, but they’re still less imperfect than any other. Unless of course you’re recommending going back to analog computers or bizarre variable-length integers in rational fractions.”

No, I’m recommending understanding what is going on underneath the hood and not simply believing that there’s something magically analog, continuous, and precise about floating point representation.

“That’s true, but only in this sense: A 32-bit integer has more resolution than a 32-bit floating point number, _IFF_ the number you’re working with has a very narrow dynamic range (say, at most, factor of 2). But these days FPUs tend to have FP representations using as many as 80 bits, so the FP number is likely to have much better resolution.”

Again, bit for bit, fixed point is more precise in a set range (the range being appx 2^(num of exp bits) because it has more bits available for representation. For a 32 bit float, that’s typically 8 bits, for a 64bit it’s 11.

As for your state variable problem, I hate to break it to you, but your accumulator state variable suffers the exact same problem in floating point or fixed point (and again, the problem is less pronounced in fixed point, assuming you’ve chosen your scale properly). In both cases, if you’re accumulating enough of the 2^0 bits, your accumulator eventually exceed 2^max. Floating point would automatically scoot the radix over by 1, but then addition of more 2^0 bits (now 2^-1, since the radix moved) would fail to be accumulated because they simply can’t be represented. Fixed point would roll over (or saturate, depending on the implementation)…but again, later because you have more bits.

The real issue with your accumulator is you don’t have enough bits for what you’re trying to accumulate. If saying FLOAT gives you an 48-ish bit value and that solves the problem over a 32 bit fixed point representation, it isn’t FLOAT that’s fixing the problem. Either that or your filter is unstable and the larger bit bucket of FLOAT simply gives you more time before it diverges.

Floating point has its uses, but it isn’t some cure-all, and it has its own set of limitations imposed while ‘solving’ the problems of fixed point representation.

J.,

Re: “I think the situation is exactly the opposite of what you describe. IMX it’s the fixed-point implementation which has by far the greatest risk of accuracy loss, as well as (by far!) the greatest risk of difficult-to-find errors.”

I believe you are talking about full, fixed-point library solutions per this thread. Sorry I didn’t make clear that I was actually talking about the super-simplified case of using integer math with a simple conversion scalar coming in and going out. Easy to design, easy to verify, easy to define actions in over/under-flow cases… what’s not to like? It’s the only way to go when your requirements don’t absolutely force you into a fixed-point library solution.

-D.

Fixed point or floating point? I guess it would depend on what I was trying to accomplish, and the resolution and accuracy of your inputs and outputs.

To me, floating-point and fixed-point are exchangeable algorithm-wise. The choice is between development time and execution speed.

I tend to use floating-point for complex algorithms that are not time-critical. The development time is shorter when you don’t need to do the scaling manually.

But for time-critical pieces, fixed-point always comes out ahead. There are more optimization opportunities when dealing with fixed-point.

In both cases you need to know your data range to avoid precision issues.