### 统计代写|数值分析和优化代写numerical analysis and optimazation代考|Fundamentals

## 统计代写|数值分析和优化代写numerical analysis and optimazation代考|Floating Point Arithmetic

We live in a continuous world with infinitely many real numbers. However, a computer has only a finite number of bits. This requires an approximate representation. In the past, several different representations of real numbers have been suggested, but now the most widely used by far is the floating point representation. Each floating point representations has a base $\beta$ (which is always assumed to be even) which is typically 2 (binary), 8 (octal), 10 (decimal), or 16 (hexadecimal), and a precision $p$ which is the number of digits (of base $\beta$ ) held in a floating point number. For example, if $\beta=10$ and $p=5$, the number $0.1$ is represented as $1.0000 \times 10^{-1}$. On the other hand, if $\beta=2$ and $p=20$, the decimal number $0.1$ cannot be represented exactly but is approximately $1.1001100110011001100 \times 2^{-4}$. We can write the representation as $\pm d_{0} \cdot d_{1} \cdots d_{p-1} \times \beta^{e}$, where $d_{0} \cdot d_{1} \cdots d_{p-1}$ is called the significand (or mantissa) and has $p$ digits and $e$ is the exponent. If the leading digit $d_{0}$ is non-zero, the number is said to be normalized. More precisely $\pm d_{0} \cdot d_{1} \cdots d_{p-1} \times \beta^{c}$ is the number
$$\pm\left(d_{0}+d_{1} \beta^{-1}+d_{2} \beta^{-2}+\cdots+d_{p-1} \beta^{-(p-1)}\right) \beta^{e}, 0 \leq d_{i}<\beta$$
If the exponents of two floating point numbers are the same, they are said to be of the same magnitude. Let’s look at two floating point numbers of the same magnitude which also have the same digits apart from the digit in position $p$, which has index $p-1$. We assume that they only differ by one in that digit. These floating point numbers are neighbours in the representation and differ by
$$1 \times \beta^{-(p-1)} \times \beta^{e}=\beta^{e-p+1} .$$
Thus, if the exponent is large the difference between neighbouring floating point numbers is large, while if the exponent is small the difference between neighbouring floating point numbers is small. This means floating point numbers are more dense around zero.

## 统计代写|数值分析和优化代写numerical analysis and optimazation代考| Overflow and Underflow

Both overflow and underflow present difficulties but in rather different ways. The representation of the exponent is chosen in the IEEE binary standard with this in mind. It uses a biased representation (as opposed to sign/magnitude and two’s complement, for which see [12] I. Koren Computer Arithmetic Algorithms). In the case of single precision, where the exponent is stored in 8 bits, the bias is 127 (for double precision, which uses 11 bits, it is 1023 ). If the exponent bits are interpreted as an unsigned integer $k$, then the exponent of the floating point number is $k-127$. This is often called the unbiased exponent to distinguish it from the biased exponent $k$.

In single precision the maximum and minimum allowable values for the unbiased exponent are $e_{\max }=127$ and $e_{\min }=-126$. The reason for having $\left|e_{\min }\right|<e_{\max }$ is so that the reciprocal of the smallest number (i.e., $1 / 2^{e_{\min }}$ ) will not overflow. However, the reciprocal of the largest number will underflow, but this is considered less serious than overflow.

The exponents $e_{\max }+1$ and $e_{\min }-1$ are used to encode special quantities as we will see below. This means that the unbiased exponents range between $e_{\min }-1=-127$ and $e_{\max }+1=128$, whereas the biased exponents range between 0 and 255 , which are the non-negative numbers that can be represented using 8 bits. Since floating point numbers are always normalized, the most significant bit of the significand is always 1 when using base $\beta=2$, and thus this bit does not need to be stored. It is known as the hidden bit. Using this trick the significand of the number 1 is entirely zero. However, the significand of the number 0 is also entirely zero. This requires a special convention to distinguish 0 from 1 . The method is that an exponent of $e_{\text {min }}-1$ and a significand of all zeros represents 0. The following table shows which other special quantities are encoded using $e_{\max }+1$ and $e_{\min }-1$.

## 统计代写|数值分析和优化代写numerical analysis and optimazation代考| Absolute, Relative Error, Machine Epsilon

Suppose that $x, y$ are real numbers well away from overflow or underflow. Let $x^{}$ denote the floating-point representation of $x$. We define the absolute error $\epsilon$ by $$x^{}=x+\epsilon$$
and the relative error $\delta$ by
$$x^{*}=x(1+\delta)=x+x \delta$$
Thus
$$\epsilon=x \delta \quad \text { or, if } \quad x \neq 0, \quad \delta=\frac{\epsilon}{x} .$$
The absolute and relative error are zero if and only if $x$ can be represented exactly in the chosen floating point representation.

In floating-point arithmetic, relative error seems appropriate because each number is represented to a similar relative accuracy. For example consider $\beta=10$ and $p=3$ and the numbers $x=1.001 \times 10^{3}$ and $y=1.001 \times 10^{0}$ with representations $x^{}=1.00 \times 10^{3}$ and $y^{}=1.00 \times 10^{0}$. For $x$ we have an absolute error of $\epsilon_{x}=0.001 \times 10^{3}=1$ and for $y \epsilon_{y}=0.001 \times 10^{0}=0.001$.

## 统计代写|数值分析和优化代写numerical analysis and optimazation代考|Floating Point Arithmetic

±(d0+d1b−1+d2b−2+⋯+dp−1b−(p−1))b和,0≤d一世<b

1×b−(p−1)×b和=b和−p+1.

## 统计代写|数值分析和优化代写numerical analysis and optimazation代考| Absolute, Relative Error, Machine Epsilon

X∗=X(1+d)=X+Xd

ε=Xd 或者如果 X≠0,d=εX.

