Toy floating point
Mathematically, if we had infinite precision, x is 0, 1-x is 1, and we have finished. But if we do the calculations with floating point numbers everything is more complicated. First, if x is not 0 but very small, the result of 1+x and 1-x are also 1.Almost all modern computers use the IEEE standard, but it is very complicated. To understand what is happening will use a toy version without denormalized numbers, infinities nor NANs. But we will try to keep the part of the behavior that interests us.
The toy version uses:
S|EXP|MANTISSA
- S is for the sign: it uses a single bit, 0 indicates positive and 1 negative, it doesn't matter too much in what follows, so let's write it as + or - directly.
- EXP stores the exponent: The idea is that the number is SOMETHING*2EXP. As all of this is in binary, so we use SOMETHING*2EXP instead of SOMETHING*10EXP. If EXP is positive then we have a large number, and if EXP is negative is a fraction. Since it is a toy version we will use magic and assume that EXP can be any integer (In the real version, EXP only takes values in a certain range and also serves to identify the infinite and NAN. [*])
- MANTISSA: The toy version will assume that we have 4 bits ABCD. It is not very important that it has exactly 4 bits, but it is easier do the calculations with 4 bits instead of the to the twenties of the real version.
If we normalize a number, we multiply or divide it by 2 until we get a number between 1 (included) and 2 (excluded). When we write it in base 2 the integer part is 1 and the rest are the decimal part. Then we can avoid to write the first 1 because it's always there, and then we interpret the mantissa ABCD as the number 1.ABCD2.
Examples
Putting all of this together, the number S|EXP|ABCD isX = S 2EXP 1.ABCD2
For example +|-3|1001 represents the number
X = + 2-3 1.10012 = + 1/8 (1 + 1 1/2 + 0 1/4 + 0 1/8 + 1 1/16) = 1/8 * 25/16 = 25/128 = 0.1953125
Another example -|2|0011 represents the number
X = - 22 1.00112 = - 4 (1 + 0 + 0 1/2 1 1/8 1/4 + 1/16 + 1) = - 4 * 19/16 = - 19/4 = - 4.75
The astute reader should have noticed already that if we have normalized the number to be something between 1 and 2, the number can never be 0. Or in another words, when we added 1 to ABCD, there is no way to write the 0. In the toy version, there is no 0! (In the actual version this is also stored in EXP, and we have to add this case to [*].) For the calculations we will do, it's not very bad that there is no 0, so we just move on.
The numbers that are closer to 1
With this notation we can't write all the real numbers, only some fractions with a denominator that is a power of 2 (this includes integers).For example number 1, we can write it as +|0|0000 because
X = + 20 1.00002 = + 1 (1 + 0 1/2 + 0 1/4 + 0 1/8 + 0 1/16) = 1 * 16/16 = 1
(It's strange, but all zeros is not 0 but 1.)
The next number we can write going up as little as possible, is +|0|0001 which is
X = + 20 1.00012 = + 1 (1 + 0 + 0 1/2 0 1/8 1/4 + 1/16 + 1) = 1 * 17/16 = 1.0625 = 1 + 0.0625
We can not write any number between 1 and 1.0625, because there are leaps. (In the notation IEEE mantissa has more bits and the leaps are much smaller, but they are still there.)
Going downward is harder. The numbers of the form +|0|ABCD are between 1 (inclusive) and 2 (excluded). So to grab a smaller number we have to lower the exponent, and look at the numbers +|-1|ABCD. The largest of these is clearly +|-1|1111 representing the number
X = + 2-1 1.11112 = + 1/2 (1 + 1 1/2 + 1 1/4 + 1 1/8 + 1 1/16) = 1/2 * 31/16 = 31/32 = 0.96875 = 1 - 0.03125
In other words, from the 1, there is a leap of 0.0625 = 1/16 upwards a leap of 0.03125 = 1/32 downwards. The interval of numbers that can not be written depends on the range where we are working and it just change at 1 as we saw in this example.
If 1+x is 1, how much is x?
Actually, we are interested only in the case where x is positive. If x is negative the idea is equivalent.To get a result of 1+x that is not 1, it has to be at least 1+1/16, but in this representation the addition is rounded to the nearest representable number. Halfway between 1 and 1+1/16 is the number 1+1/32. The results below 1+1/32 are rounded to 1 and the results greater than 1+1/32 are rounded to 1+1/16. If the result is exactly 1+1/32 the rounding is not clear, let's suppose that the criterion is to choice for the even side and that it will be rounded to 1.
Therefore if 1+x equals 1, x has to be between 0 and almost 1/32. (Case 1/32 is doubtful, we should read all the standard to be sure.)
If 1-x is 1, how much is x?
To get a result of 1-x that is not 1, the result has to be most 1-1/32, after rounding. Halfway between 1 and 1-1/32 is the number 1-1/64. The results greater than 1-1/64 are rounded to 1 and the smaller than 1-1/64 are rounded to 1-1/32. If the result is exactly 1-1/64, then it depends again on the rounding details, let's suppose that the criterion is to chooce for the even side and that this will be rounded to 1 too. Therefore if 1-x equals 1, x has to be between 0 and almost 1/64. (Case 1/64 is doubtful, if we had already read all standard we would know what happens.)If 1+x is 1, how much is 1-x?
What matters is that the range of x (positive) where 1+x is 1 is different from the range of x in which 1-x is 1, and that's going to cause unexpected results.
x
|
1+x | 1-x |
---|---|---|
0<x<1/64 | 1 | 1 |
x=1/64 | 1 | 1? |
1/64<x<1/32 | 1 | 1-1/32 |
x=1/32 | 1? | 1-1/32 |
1/32<x | 1+1/16 or more | 1-1/32 or less |
So the interesting cases are when x is between 1/64 and 1/32 (and perhaps the edges). With exponential notation, when x is between 1/26 and 1/25. With some optimism, we can write this as 1/24+2 and 1/24+1, remembering that the mantissa has 4 bits. Looking carefully at the calculations we did, we can transform this into a proof and rewrite it as 1/2M+2 and 1/2M+1, where M is the number of bits we wrote explicitly in the mantissa.
If 1 + x is 1, how much is 1-x? (IEEE)
Using the IEEE standard rather than the toy version, the most important difference is that instead of having only 4 bits of mantissa we have 24 bits for single (32 bits counting those used to store the sign and exponent) and 52 bits for double (64 bits counting those used to store the sign and exponent). The exponent is no longer magical, and we have to be careful with the NAN, infinite and zero, but luckily these changes don't affect what we want to see. Assuming that the reader had done the proof, following the previous steps, to get a result of 1+x that is x but the result of 1-x is not 1, then x must be in the following rangeStandar | Total | Mantisa | Xmin | Xmax |
---|---|---|---|---|
Toy | ∞ :) | 4 | 1/26 = 1/4 1/24 = 0.0156 | 1/25 = 1/2 1/24 = 0,0312 |
Single | 32 | 23 | 1/225 = 1/4 1/223 = 2.98*10-8 | 1/224 = 1/2 1/223 = 5.96*10-08 |
Double | 64 | 52 | 1/254 = 1/4 1/252 = 5.55*10-17 | 1/253 = 1/2 1/252 = 1.11*10-16 |
Experimentally
Let's see the results experimentally. Let's try the following program in Racket. First we make a generic function that takes values of x between 0 and 1/2bits, spliced in 2extra steps. After that, we test it with double precision numbers (like 1.0) and single precision numbers (like 1.f0).#lang racket
(define (almost-one bits extra one)
(for ([i (in-range (add1 (expt 2 extra)))])
(define x (/ i (expt 2 (+ bits extra))))
(displayln (format "~a/2^~a ~a ~a ~a"
(~r (* x (expt 2 bits)) #:precision '(= 4))
bits
(= (+ one x) one)
(= (- one x) one)
(~r x #:notation 'exponential)))))
(almost-one 52 4 1.0)
(newline)
(almost-one 23 4 1.f0)
The results are
Double precision (32 bits) | Single precision (16 bits) |
---|---|
(almost-one 52 4 1.0) | (almost-one 23 4 1.f0) |
0.0000/2^52 #t #t 0e+00 0.0625/2^52 #t #t 1.387779e-17 0.1250/2^52 #t #t 2.775558e-17 0.1875/2^52 #t #t 4.163336e-17 0.2500/2^52 #t #t 5.551115e-17 0.3125/2^52 #t #f 6.938894e-17 0.3750/2^52 #t #f 8.326673e-17 0.4375/2^52 #t #f 9.714451e-17 0.5000/2^52 #t #f 1.110223e-16 0.5625/2^52 #f #f 1.249001e-16 0.6250/2^52 #f #f 1.387779e-16 0.6875/2^52 #f #f 1.526557e-16 0.7500/2^52 #f #f 1.665335e-16 0.8125/2^52 #f #f 1.804112e-16 0.8750/2^52 #f #f 1.94289e-16 0.9375/2^52 #f #f 2.081668e-16 1.0000/2^52 #f #f 2.220446e-16 |
0.0000/2^23 #t #t 0e+00 0.0625/2^23 #t #t 7.450581e-09 0.1250/2^23 #t #t 1.490116e-08 0.1875/2^23 #t #t 2.235174e-08 0.2500/2^23 #t #t 2.980232e-08 0.3125/2^23 #t #f 3.72529e-08 0.3750/2^23 #t #f 4.470348e-08 0.4375/2^23 #t #f 5.215406e-08 0.5000/2^23 #t #f 5.960464e-08 0.5625/2^23 #f #f 6.705523e-08 0.6250/2^23 #f #f 7.450581e-08 0.6875/2^23 #f #f 8.195639e-08 0.7500/2^23 #f #f 8.940697e-08 0.8125/2^23 #f #f 9.685755e-08 0.8750/2^23 #f #f 1.043081e-07 0.9375/2^23 #f #f 1.117587e-07 1.0000/2^23 #f #f 1.192093e-07 |
If we increase the extra value, we see that the maximum is always in 1/2 and the minimum gets closer and closer to 1/4, as we marked in the other table.
To think about
- What about (almost-one 52 4 0.5)?
- What about (almost-one 52 4 1.5)?
- What about (almost-one 52 4 1)?
Edit: Thanks to Colin Wright for many comments and corrections on the first published version.