Floating point
From Academic Kids

A floatingpoint number is a digital representation for a number in a certain subset of the rational numbers, and is often used to approximate an arbitrary real number on a computer. In particular, it represents an integer or fixedpoint number (the significand or, informally, the mantissa) multiplied by a base (usually 2 in computers) to some integer power (the exponent). When the base is 2, it is the binary analogue of scientific notation (in base 10).
A floatingpoint calculation is an arithmetic calculation done with floatingpoint numbers and often involves some approximation or rounding because the result of an operation may not be exactly representable.
A floatingpoint number a can be represented by two numbers m and e, such that a = m × b^{e}. In any such system we pick a base b (called the base of numeration, also the radix) and a precision p (how many digits to store). m (which is called the significand or, informally, mantissa) is a p digit number of the form ±d.ddd...ddd (each digit being an integer between 0 and b−1 inclusive). If the leading digit of m is nonzero then the number is said to be normalized. Some descriptions use a separate sign bit (s, which represents −1 or +1) and require m to be positive. e is called the exponent.
This scheme allows a large range of magnitudes to be represented within a given size of field, which is not possible in a fixedpoint notation.
As an example, a floatingpoint number with four decimal digits (b = 10, p = 4) and an exponent range of ±4 could be used to represent 43210, 4.321, or 0.0004321, but would not have enough precision to represent 432.123 and 43212.3 (which would have to be rounded to 432.1 and 43210). Of course, in practice, the number of digits is usually larger than four.
In addition, floatingpoint representations often include the special values +∞, −∞ (positive and negative infinity), and NaN ('Not a Number'). Infinities are used when results are too large to be represented, and NaNs indicate an invalid operation or undefined result.
Contents 
Usage in computing
While in the examples above the numbers are represented in the decimal system (that is the base of numeration, b = 10), computers usually do so in the binary system, which means that b = 2. In computers, floatingpoint numbers are sized by the number of bits used to store them. This size is usually 32 bits or 64 bits, often called "singleprecision" and "doubleprecision". A few machines offer larger sizes; Intel FPUs such as the Intel 8087 (and its descendants integrated into the x86 architecture) offer 80 bit floating point numbers for intermediate results, and several systems offer 128 bit floatingpoint, generally implemented in software. This website (http://babbage.cs.qc.edu/courses/cs341/IEEE754.html) can be used to calculate the floating point representation of a decimal number.
Problems with floatingpoint
Floatingpoint numbers usually behave very similarly to the real numbers they are used to approximate. However, this can easily lead programmers into overconfidently ignoring the need for numerical analysis. There are many cases where floatingpoint numbers do not model real numbers well, even in simple cases such as representing the decimal fraction 0.1, which cannot be exactly represented in any binary floatingpoint format. For this reason, financial software tends not to use a binary floatingpoint number representation. See: http://www2.hursley.ibm.com/decimal/
Errors in floatingpoint computation can include:
 Rounding
 Nonrepresentable numbers: for example, the literal 0.1 cannot be represented exactly by a binary floatingpoint number
 Rounding of arithmetic operations: for example 2/3 might yield 0.6666667
 Absorption: 1×10^{15} + 1 = 1×10^{15}
 Cancellation: subtraction between nearly equivalent operands
 Overflow, which usually yields an infinity
 Underflow (often defined as an inexact tiny result outside the range of the normal numbers for a format), which yields zero, a subnormal number, or the smallest normal number
 Invalid operations (such as an attempt to calculate the square root of a nonzero negative number). Invalid operations yield a result of NaN (not a number).
 Rounding errors: unlike the fixedpoint counterpart, the application of dither in a floating point environment is nearly impossible. See external references for more information about the difficulty of applying dither and the rounding error problems in floating point systems
Floating point representation is more likely to be appropriate when proportional accuracy over a range of scales is needed. When fixed accuracy is required, fixed point is usually a better choice.
Properties of floating point arithmetic
Arithmetic using the floating point number system has two important properties that differ from those of arithmetic using real numbers.
Floating point arithmetic is not associative. This means that in general for floating point numbers x, y, and z:
 <math> (x + y) + z \neq x + (y + z) <math>
 <math> (x \cdot y) \cdot z \neq x \cdot (y \cdot z) <math>
Floating point arithmetic is also not distributive. This means that in general:
 <math> x \cdot (y + z) \neq (x \cdot y) + (x \cdot z) <math>
In short, the order in which operations are carried out can change the output of a floating point calculation. This is important in numerical analysis since two mathematically equivalent formulas may not produce the same numerical output, and one may be substantially more accurate than the other.
For example, with most floatingpoint implementations, (1e100  1e100) + 1.0 will give the result 1.0, whereas (1e100 + 1.0)  1e100 gives 0.0.
IEEE standard
The IEEE has standardized the computer representation for binary floatingpoint numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which have both hexadecimal and IEEE 754 data types, and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floatingpoint format.
As of 2000, the IEEE 754 standard is currently under revision. See: IEEE 754r
Examples
 The value of Pi, π = 3.1415926..._{10} decimal, which is equivalent to binary 11.001001000011111..._{2}. When represented in a computer that allocates 17 bits for the significand, it will become 0.11001001000011111 × 2^{2}. Hence the floatingpoint representation would start with bits 01100100100001111 and end with bits 10 (which represent the exponent 2 in the binary system). Note: the first zero indicates a positive number, the ending 10_{2} = 2_{10}.)
 The value of 0.375_{10} = 0.011_{2} or 0.11 × 2^{−1}. In two's complement notation, −1 is represented as 11111111 (assuming 8 bits are used in the exponent). In floatingpoint notation, the number would start with a 1 for the sign bit, followed by 110000... and then followed by 11111111 at the end, or 1110...011111111 (where ... are zeros).
Hidden bit
When using binary (b = 2), one bit, called the hidden bit or the implied bit, can be omitted if all numbers are required to be normalized. The leading digit (most significant bit) of the significand of a normalized binary floatingpoint number is always nonzero; in particular it is always 1. This means that this bit does not need to be stored explicitly, since for a normalized number it can be understood to be 1.
The IEEE 754 standard exploits this fact. Requiring all numbers to be normalized means that 0 cannot be represented; typically some special representation of zero is chosen. In the IEEE standard this special code also encompasses denormal numbers, which allow for gradual underflow. The normalized numbers are also known as the normal numbers.
Note
Note that although the examples in this article use a consistent system of floatingpoint notation, the notation is different from the IEEE standard. For example, in IEEE 754, the exponent is between the sign bit and the significand, not at the end of the number. Also the IEEE exponent uses a biased integer instead of a two's complement number. The reader should note that the examples serve the purpose of illustrating how floatingpoint numbers could be represented, but the actual bits shown in the article are different from those in a IEEE 754compliant representation. The placement of the bits in the IEEE standard enables two floatingpoint numbers to be compared bitwise (sans sign bit) to yield a result without interpreting the actual values. The arbitrary system used in this article cannot do the same.
See also
 Fixedpoint arithmetic
 Computable number
 IEEE Floating Point Standard
 IBM Floating Point Architecture
 FLOPS
References
 An edited reprint of the paper What Every Computer Scientist Should Know About FloatingPoint Arithmetic (http://docs.sun.com/source/8063568/ncg_goldberg.html), by David Goldberg, published in the March, 1991 issue of Computing Surveys.
 David Bindel’s Annotated Bibliography (http://www.cs.berkeley.edu/~dbindel/class/cs279/dsbbib.pdf) on computer support for scientific computation.
 Kahan, William and Darcy, Joseph (2001). How Java’s floatingpoint hurts everyone everywhere. Retrieved Sep. 5, 2003 from http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf.
 Introduction to Floating point calculations and IEEE 754 standard (http://www.geocities.com/SiliconValley/Pines/6639/docs/fp_summary.html) by Jamil Khatibde:Gleitkommazahl
es:Coma flotante fr:Virgule flottante nl:Drijvendekommagetal ja:浮動小数点数 pl:Liczba zmiennoprzecinkowa fi:Liukuluku zh:浮点数