Last update March 7, 2012

Doc Comments /

Floating Point

More Information



Put your comments about the official/non-official page here.

Rounding Control

IEEE 754 floating point arithmetic includes the ability to set 4 different rounding modes. D adds syntax to access them: [blah, blah, blah] [NOTE: this is perhaps better done with a standard library call]

Exception Flags

IEEE 754 floating point arithmetic can set several flags based on what happened with a computation: [blah, blah, blah]. These flags can be set/reset with the syntax: [blah, blah, blah] [NOTE: this is perhaps better done with a standard library call]

How about finishing this page?


You may add a link to this page on the DocumentationAmendments page to draw extra attention to your suggestion.

On many computers, greater precision operations do not take any longer than lesser precision operations, so it makes numerical sense to use the greatest precision available for internal temporaries... On Intel x86 machines, for example, it is expected (but not required) that the intermediate calculations be done to the full 80 bits of precision implemented by the hardware.

This rationale is incorrect. Quoting the IA-32 Intel® Architecture Optimization Reference Manual:

Do not use double precision unless necessary. Set the precision control (PC) field in the x87 FPU control word to "Single Precision". This allows single precision (32-bit) computation to complete faster on some operations...

Single precision operations allow the use of longer SIMD vectors, since more single precision data elements can fit in a register...

x87 supports 80-bit precision, double extended floating point. Streaming SIMD Extensions support a maximum of 32-bit precision, and Streaming SIMD Extensions 2 supports a maximum of 64-bit precision.

In a Pentium 4, the x87 instructions are effectively deprecated. They're painfully slow, slower than they were in the Pentium III. Using 80-bit arithmetic for all intermediate operations will make floating point performance three times slower just for this reason alone. The optimization reference manual cited above explains why smaller operand size is so important -- memory bandwidth is often a performance bottleneck. And vectorization (if gcc ever gets around to supporting it) will make for another factor of two slowdown from 32-bit to 64-bit. -- TimStarling


In the context of the Intel doc., it looks like what they are suggesting is that the application programmer (as opposed to the compiler developer) use single precision when double precision is not needed. It's a common recommendation that the application programmer use single precision (floats) rather than doubles if the extra precision is not needed and there is a lot of floating point data moving around, because it is often faster.

On Intel (including the P4) the floating point registers are 80 bit. All the author of is suggesting in the context of the D language is that compiler developers shouldn't have to limit precision to 32 bits (floats) or 64 bits (doubles) if keeping 80 bit precision results in faster code. D is allowing for this where other languages may specify a maximum precision regardless of the what is best for the hardware.

The best contemporary (Fall, 2004) optimizing compilers all use 80 bit precision to/from the Intel floating point registers for intermediate data when "maximum performance" switches are set. And for cases when strict maximum precision is needed all also have a switch to "improve floating point consistency" by rounding/truncating intermediate values, which is often a speed "deoptimization" [this includes code generation for both the P4 and AMD64 chips]. D on the other hand follows IEEE 754 minimum precision guidelines for floats and doubles, doesn't specify a maximum precision and also offers the real (80 bit floating point) type for code that would benefit from that.

I don't see anywhere in that Intel doc. where it says that 80 bit floating point register operations are "deprecated".

For operations (and compilers) that take advantage of SIMD instructions, then it is probably best to stick to 32 or 64 bit floating point types for code that can be vectorized. From what I've seen, contemporary compilers often don't do better than a mediocre job of vectorizing and often fall back to using the 80 bit floating point register operations.

SSE2 optimization

The only reason Intel is keeping around the x87 math instructions is for backwards compatibility. Their documentation recommends switching to SSE and SSE2 for floating point functionality. Compiler optimizations that use SSE2 are now a reality (e.g. MS visual I and others have noted 50% to 100% speedups in floating point code using these optimizations. I would love to use D for some of my scientific computing, but without these optimizations it's a nonstarter. Contrary to the Mantra of some developers, speed does matter. I still have floating point Monte Carlo simulations that take days to run. I wonder if there are any plans for backend support for SSE and/or SSE2 optimization in D?

Comment: The reason 80-bit instructions are being "deprecated" is because they aren't used by most compilers (especially Microsoft). So Intel and AMD are paying less attention to them. The change is driven by compilers, not by chip makers.

reals and ireals support the .re and .im properties. if

real x=7; ireal y=2;

then = 7 = 0 = 0 = 2

Floating Point Quirks

It is a mistake to assume that in floating point, it is possible to design an algorithm that does not degrade with increased precision.

For instance, many computations (of the Gamma function, for example) rely on series expansions with pre-computed constants in order to calculate the result. It may make a great deal of difference if I use 3.14 for single precision, 3.14159 for double precision, and 3.14159265 for extended precision.

Next, IEEE floating point traps are NOT the same as exceptions - sometimes a trap can be signaling and sometimes not. The same goes for NaNs? - some signal, some don't.

Here is where you can get into a flame war - GCC has an option called 'finite-math' that allows the compiler to assume things like "a==a" is always true. With true ieee arithmetic, "a==a" can be false if 'a' is a NaN?. Which is more important to you - fast or ieee-correct?

Ditto goes with vectorization primitives. Don't assume that you'll be running on SSE-type hardware - IBM build lots of number crunchers with AltiVec?..., and few compilers (even Intel's) are super-good with auto-vectorization.

For many floating point issues, see - it is a pain to read, but full of good info. Also, look into the Fortran2000 community... those people KNOW how to implement floating point...

Cheers, -Andrew <andrew AT>

Floating point evaluation may be different at compile-time and runtime.

The compiler is allowed to evaluate intermediate results at a greater precision than that of the operands. The literal type suffix (like 'f') only indicates the type. The compiler may maintain internally as much precision as possible, for purposes of constant folding. Committing the actual precision of the result is done as late as possible.

For a low-precision constant put the value into a static, non-const variable. Since this is not really a constant, it cannot be constant folded and therefore affected by a possible compile-time increase in precision. However, if mixed with a higher precision at runtime, a increase in precision will still occur.


See the corresponding page in the D Specification: DigitalMars:d/float.html
FrontPage | News | TestPage | MessageBoard | Search | Contributors | Folders | Index | Help | Preferences | Edit

Edit text of this page (date of last change: March 7, 2012 20:24 (diff))