## 数学代写|数值分析代写numerical analysis代考|Floating point arithmetic

A computer must store a finite amount of data – and as such, all numbers and arithmetic are done with some error. At times, this ‘finite precision’ issue is minor, and the theory can largely ignore it (accepting there will be error in the practical answer). We will typically develop theory without too much concern for rounding error unless it really matters.

It is important to be able to recognize rounding error and understand how it manifests (and have some intuition for when it is important – e.g. the catastrophic cancellation example above).

Let us define the set of machine numbers to be the number system used by a typical computer/language – that is, a ‘double precision’ number (a double in $\mathrm{C} / \mathrm{C++}$, and the default numeric type in python/matlab). ${ }^2$ Such a number is stored in memory in the ‘floating point’ form
$$\text { (base 2) } \pm 1 . d_1 d_2 \cdots d_N \times 2^e=\left(1+\sum_{k=1}^n d_k 2^{-k}\right) 2^e, \quad m \leq e \leq M$$
where the $d_i$ ‘s are binary digits (zero or one) and $N=52$ and $m, M$ are limits for the exponent. ${ }^3$

Further, let us define the ’rounding’ operation
$$\mathrm{fl}(x)=\text { ‘nearest’ machine number (2) to } x \in \mathbb{R} \text {. }$$
Because there are only $N$ binary digits in the machine number, the numbers are a finite sequence. Starting from 1, the first few values are
$$\text { 1, 1. } \underbrace{00 \cdots 0}_{N-1 \text { zeros }} 1=1+2^{-N}, \quad \cdots$$
The distance from 1 to the next largest number is important and has a special name:
$$\text { machine epsilon }=\epsilon_m:=2^{-N} \quad\left(\approx 2.2 \times 10^{-16} \text { for a double }\right)$$
The ’rounding error’ incurred by representing a real number $x$ by a machine number $\mathrm{fl}(x)$ is bounded above by half this distance, as the sketch below indicates.

## 数学代写|数值分析代写numerical analysis代考|Condition

Suppose we wish to solve a problem with an input $x$ and output $f(x)$. If the value of $x$ is changed by an amount $\delta x$ of size $|\delta x| \leq \epsilon$, then the output $f$ changes by an amount $\delta f=f(x+\delta x)-f(x)$.
Conditioning: A problem is called well-conditioned if small changes in the input lead to small changes in the output $(\delta x$ small implies $\delta f$ small, with ‘small’ in whatever sense is relevant).

If the problem is sensitive to small changes in $\delta x-$ to the point of computational difficulty – the problem is called ill-conditioned.
For each type of problem, there is a measure of condition – the condition number). Given $\delta x$ of this small size, we have that
$$\text { relative sensitivity to } \delta x=\sup {|\delta x| \leq \epsilon}\left|\frac{\delta f / f}{\delta x / x}\right| \text {. }$$ Taking the limit as $\epsilon \rightarrow 0$ gives the desired measure of the system’s sensitivity: $$\text { (relative) condition number }=\lim {\epsilon \backslash 0} \sup _{|\delta x| \leq \epsilon}\left|\frac{(f(x+\delta x)-f(x)) / f(x)}{\delta x / x}\right|$$
The problem is ill-conditioned if this number is large, since then a small error made in the input can lead to a drastic difference in the output.

Key point (ill-conditioned problems): Unfortunately, the poor condition is inherent to the problem, so a correct algorithm would likely inherit the same sensitivity. For this reason, illconditioned problems are hard to solve numerically (and best avoided if possible!).
For example, consider the problem of evaluating
$$f(x)=\tan x, \quad x \approx \pi / 2 .$$
Suppose, say, we take $x_1=\pi / 2-0.001$ and $x_2=\pi / 2-0.002$. Then
$$\left|x_1-x_2\right|=0.001, \quad\left|f\left(x_1\right)-f\left(x_2\right)\right|=500$$
so the small difference in the $x$-values leads to large differences in $f$.

# 数值分析代考

## 数学代写|数值分析代写numerical analysis代考|Floating point arithmetic

$$\text { (base 2) } \pm 1 . d_1 d_2 \cdots d_N \times 2^e=\left(1+\sum_{k=1}^n d_k 2^{-k}\right) 2^e, \quad m \leq e \leq M$$

$$\mathrm{fl}(x)=\text { ‘nearest’ machine number (2) to } x \in \mathbb{R} .$$

$$\text { 1, 1. } \underbrace{00 \cdots 0}_{N-1 \text { zeros }} 1=1+2^{-N}, \quad \cdots$$

$$\text { machine epsilon }=\epsilon_m:=2^{-N} \quad\left(\approx 2.2 \times 10^{-16} \text { for a double }\right)$$

## 数学代写|数值分析代写numerical analysis代考|Condition

$$\text { relative sensitivity to } \delta x=\sup |\delta x| \leq \epsilon\left|\frac{\delta f / f}{\delta x / x}\right| \text {. }$$

$$\text { (relative) condition number }=\lim \epsilon \backslash 0 \sup _{|\delta x| \leq \epsilon}\left|\frac{(f(x+\delta x)-f(x)) / f(x)}{\delta x / x}\right|$$

$$f(x)=\tan x, \quad x \approx \pi / 2 .$$

$$\left|x_1-x_2\right|=0.001, \quad\left|f\left(x_1\right)-f\left(x_2\right)\right|=500$$

