Resolution of Differences Between X86 Linux/glibc Floating-Point to Integer Conversion Behavior Relative to Other Platforms S. R. Whiteley, Whiteley Research Inc., 8/28/01 stevew@wrcad.com Summary: Under Linux/glibc on Intel X86 (and compatible) platforms, the floating point processor is operated by default in extended precision mode. This can lead to observed differences with respect to conversions from floating point to integers from other processors, and from other operating systems such as FreeBSD which operate the X86 floating point processor in double precision mode. In particular, a problem has arisen when porting legacy workstation and scientific software to Linux. Below is a description of a problem encountered in Red Hat Linux 6.0 involving floating point numerics. The problem arose after porting a large CAD application, which has run successfully for many years on Sun, HP, Dec, and FreeBSD workstations. The Linux port had been in existence for some time, however a user reported seeing numerical problems, which were traced to code illustrated by the following example: char *a = "0.3"; double d = atof(a); int i = (int)(1000*d); if (i != 300) printf("Linux result (299)\n"); else printf("Result is 300\n"); The "300" result is obtained on the workstations that were at hand from the vendors listed above, whereas Linux, and later it was learned, Windows, returned an "anomalous" result of 299. Note added: The Windows result was obtained with the mingw port of gcc, which uses MSVCRT.DLL. The version in use at the time returned the "299" result. The version presently in use (mingw runtime 2000-01-22) returns "300". It should be stressed that the code fragment above represents bad programming practice, as it assumes a particular behavior in the floating point to integer conversion which is not guaranteed by any standard. The problem is the buried existence of similar assumptions in legacy workstation software that one day will be ported to Linux, with not entirely successful results. The official specification for C/C++ does not require conditioned numerics. Nevertheless, It seems to have been true that conditioning is employed in UNIX workstations from vendors like Sun, HP, DEC, and others, to specifically eliminate the type of error seen above, providing the intuitive result 1000*(0.3) = 300. Given that many workstation users are physicists or other technical professionals without ingrained numerical pedantry, the use of technically incorrect code like that illustrated above is probably not uncommon. The initial response was to file a bug report with the glibc project, under the (later found to be incorrect) assumption that the underlying common code in the string to floating point conversion routines atof(), strtod(), scanf(), etc. was at fault. Number: 1595 Category: libc Synopsis: sscanf(), atof() conversion to double precision, roundoff error. Environment: Red Hat Linux 6.0, glibc 2.1.1 Description: Double precision returns from scanf() and atof() (at least) lack precision for certain values, causing behavior (i.e. a broken program) that doesn't happen on FreeBSD or Solaris. e.g., sscanf("0.3", "&lf", &d); (int)(1000*d) is 299. The response from the developers contained the following: "0.3 cannot be represented as an exact floating point number. There's nothing we can do." "What you see is the difference in the implementation of the conversion functions. Some implementation use the complex and slow method to ensure that the converted binary number afterwards results to the same output value (after appropriate rounding). This is *not* what ISO C and C++ require (it is required for LISP etc). The implementation in glibc is creating the correct result, i.e., the number with the smallest error from the real result. This is different. I will not change this since a) it is more correct b) it is much faster" At this point, the matter was shelved for several months, until the author was contacted by a software developer from a large EDA tool vendor who had encountered similar problems in porting legacy code to Linux. That individual submitted a request to the glibc developer's email group, which elicited, amongst several replies, the key to the problem (give credit to Ulrich Drepper, drepper@redhat.com). First, the string to double conversion functions are not the issue, and the "complex and slow method..." mentioned above is not the issue. It was accidental that the problem was first noticed in an expression that used a string conversion function, however code like double d = 0.3; int i = (int)(1000*d); exhibits the same behavior. Second, the processor rounding mode is not by itself the key. Setting the Linux rounding mode does not change the result of the indicated integer conversion in the default X86 FPU mode under Linux/glibc. The key is in fact the precision mode of the X86 FPU. The "conditioning", or roundoff error handling is actually implemented in the X86 FPU as expressed by the rounding mode. In both Linux and FreeBSD, the rounding mode is set by default to "round to nearest". That is, the FPU keeps three hidden bits of precision, and after a computation the extra bits are used to set the result to the closest representable number. This is the correct setting for general purpose math and is the default on Linux and FreeBSD X86 systems. The difference between Linux and FreeBSD is that in Linux, the FPU is operated by default in "extended precision" mode, where 80-bit internal registers are used for floating point numbers. In FreeBSD, the default is to use "double precision" mode, where 64-bit precision is retained. Note added: mingw runtime 2000-01-22 for Windows also uses double precison mode by default. Earlier versions apparently used extended precision. The problem illustrated above arises when using extended precision, since the operand is zero-extended from 53 to 64 mantissa bits. The round to nearest operation will choose the nearest representation of the exact result, which is then converted to an integer by truncation. Since the operand is already "exact" to extended precision, the value will not change, and truncation to an integer yields the integer 299, in the example. On the other hand, if double precision is used, the mantissa of the operand is correct to full precision, so that the round-to-nearest operation favors (in this case) the next larger representable number, which is larger than the nearest integer. The result when converting to an integer is in fact the closest integer (300). When FreeBSD is operated in extended precision mode (using the fpsetprec()) function) the results of the example match those of default glibc. Conversely, the example run under Linux/glibc that sets the processor to double precision mode will provide the "intuitive" result of default FreeBSD and the other platforms. The bottom line is this- to port a legacy C/C++ program to Linux, one can add the following to the main function: #ifdef linux #include #endif main(argc, **argv) { #ifdef linux /* This puts the X86 FPU in 64-bit precision mode. The default under Linux is to use 80-bit mode, which produces subtle differences from FreeBSD and other systems, eg, (int)(1000*atof("0.3")) is 300 in 64-bit mode, 299 in 80-bit mode. */ fpu_control_t cw; _FPU_GETCW(cw); cw &= ~_FPU_EXTENDED; cw |= _FPU_DOUBLE; _FPU_SETCW(cw); #endif This code, specific to X86 Linux and gcc, puts the FPU into double precision mode for the duration of the program execution. Caveat: It is unknown if there are subtle effects in glibc math operations resulting from this change. The glibc developers discourage the use of this option, recommending instead that the source code be modified to provide explicit rounding control where needed. Use of double precision mode should probably be employed only when modification of the source code is impractical. According to Intel, reduced precision is used only in in the FADD, FSUB, FMUL, FDIV, and FSQRT instructions. Otherwise, the extended precision is used. This would seem to imply that double precision mode would not affect returns from processor-implemented transcendental functions, for example. However, Bessel functions, for example, and other functions that are implemented in software may become more susceptible to round-off error, depending upon how intermediate results are stored. Since in conventional C/C++ all operands are delivered and returned as 64-bit (or less precise) quantities, it is not clear that the added precision of 80-bit mode is important practically, considering that there are already extra precision "hidden" bits employed for rounding. However, it is very likely that someone can come up with an example of where the extra precision does make an important difference. On the other hand, assuming that most proprietary workstations use 64-bit internal precision, the results should be consistent with the workstations. Additional Information ---------------------- A good tutorial on issues in floating-point arithmetic can be found in David Goldberg, What every computer scientist should know about floating-point arithmetic, ACM Computing Surveys 23, vol 1 (1991-03), pp. 5-48 Of particular interest are the last few pages of the postscript document referenced above (which contain supplemental information apparently not in the original paper), where pitfalls of using extended precision are discussed. In particular, one potential pitfall is comparison between two naively identical results, where at the whim of the compiler one result is computed using internal registers only, and the other result has a double precision intermediate. The comparison can "mysteriously" fail. The gcc compiler, at least recent versions, supports a "long double" data type which, at 12 bytes on X86 systems, represents the full precision of the internal registers. This allows the round to nearest mode to work "correctly" in extended precision mode, i.e., long double d = 0.3; int i = (int)(1000*d) // i = 300 However, atof() and other standard math functions return ordinary doubles, thus the problem remains for returns from these functions. In glibc-2.1, there are long double versions of most if not all of the math functions, suffixed with an 'l', i.e., sinl, expl, etc. There is no mention of this in the man pages on RH 6.0, but the info file for glibc describes these extensions. Also in glibc, an "L" modifier in the printf/scanf functions and friends denotes "long double" when applied to e/f/g. There is no "atofl" function, but there is a strtold function which is supposed to return a long double result. However, in RH 6.0, this seems to be broken. The sscanf("%Ld") function appears to work correctly. // we're using extended precision int i; char *a = "0.3"; long double d; d = strtold(a, 0); printf("%Lg\n", d); // prints "-1.71799e+09" sscanf(a, "%Lf", &d); printf("%Lg\n", d); // prints "0.3" i = (int)(1000*d); // i = 300