We now start with a formal definition of a probability space and related terms from the measure theory [2].
Definition $1.9$ (Probability Space) A probability space is a triple $(\Omega, \mathcal{F}, \mu)$ consisting of the sample space $\Omega$, an event space $\mathcal{F}$ composed of a subset of $\Omega$ (which is often called $\sigma$-algebra), and the probability measure (or distribution) $\mu: \mathcal{F} \mapsto[0,1]$, a function such that:
$\mu$ must satisfy the countable additivity property that for all countable collections $\left{E_i\right}$ of pairwise disjoint sets: $$ \mu\left(\cup_i E_i\right)=\cup_i \mu\left(E_i\right) ; $$
the measure of the entire sample space is equal to one: $\mu(\Omega)=1$.
In fact, the probability measure is a special case of the general “measure” in measure theory [2]. Specifically, the general term “measure” is defined similarly to the probability measure defined above except that only positivity and the countable additivity property are required. Another important special case of a measure is the counting measure $v(A)$, which is the measure that assigns its value as the number of elements in the set $A$.
To understand the concept of a probability space, we give two examples: one for the discrete case, the other for the continuous one.
计算机代写|深度学习代写deep learning代考|Some Matrix Algebra
In the following, we introduce some matrix algebra that is useful in understanding the materials in this book.
A matrix is a rectangular array of numbers, denoted by an upper case letter, say A. A matrix with $m$ rows and $n$ columns is called an $m \times n$ matrix given by $$ \boldsymbol{A}=\left[\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1 n} \ a_{21} & a_{22} & \cdots & a_{2 n} \ \vdots & \vdots & \ddots & \vdots \ a_{m 1} & a_{m 2} & \cdots & a_{m n} \end{array}\right] . $$ The $k$-th column of matrix $\boldsymbol{A}$ is often denoted by $\boldsymbol{a}k$. The maximal number of linearly independent columns of $\boldsymbol{A}$ is called the rank of the matrix $\boldsymbol{A}$. It is easy to show that $$ \operatorname{Rank}(\boldsymbol{A})=\operatorname{dim} \operatorname{span}\left(\left[\boldsymbol{a}_1, \cdots, \boldsymbol{a}_n\right]\right) . $$ The trace of a square matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, denoted $\operatorname{Tr}(\boldsymbol{A})$ is defined to be the sum of elements on the main diagonal (from the upper left to the lower right) of $\boldsymbol{A}$ : $$ \operatorname{Tr}(\boldsymbol{A})=\sum{i=1}^n a_{i i} . $$ Definition 1.11 (Range Space) The range space of a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, denoted by $\mathcal{R}(\boldsymbol{A})$, is defined by $\mathcal{R}(\boldsymbol{A}):=\left{\boldsymbol{A} \boldsymbol{x} \mid \forall x \in \mathbb{R}^n\right}$.
Definition $1.12$ (Null Space) The null space of a matrix $A \in \mathbb{R}^{m \times n}$, denoted by $\mathcal{N}(\boldsymbol{A})$, is defined by $\mathcal{N}(\boldsymbol{A}):=\left{\boldsymbol{x} \in \mathbb{R}^n \mid \boldsymbol{A} \boldsymbol{x}=\mathbf{0}\right}$.
计算机代写|深度学习代写deep learning代考|Banach and Hilbert Space
An inner product space is defined as a vector space that is equipped with an inner product. A normed space is a vector space on which a norm is defined. An inner product space is always a normed space since we can define a norm as $|f|=$ $\sqrt{\langle\boldsymbol{f}, \boldsymbol{f}\rangle}$, which is often called the induced norm. Among the various forms of the normed space, one of the most useful normed spaces is the Banach space. Definition 1.7 The Banach space is a complete normed space. Here, the “completeness” is especially important from the optimization perspective, since most optimization algorithms are implemented in an iterative manner so that the final solution of the iterative method should belong to the underlying space $\mathcal{H}$. Recall that the convergence property is a property of a metric space. Therefore, the Banach space can be regarded as a vector space equipped with desirable properties of a metric space. Similarly, we can define the Hilbert space. Definition 1.8 The Hilbert space is a complete inner product space. We can easily see that the Hilbert space is also a Banach space thanks to the induced norm. The inclusion relationship between vector spaces, normed spaces, inner product spaces, Banach spaces and Hilbert spaces is illustrated in Fig. 1.1. As shown in Fig. 1.1, the Hilbert space has many nice mathematical structures such as inner product, norm, completeness, etc., so it is widely used in the machine learning literature. The following are well-known examples of Hilbert spaces:
$l^2(\mathbb{Z})$ : a function space composed of square summable discrete-time signals, i.e. $$ l^2(\mathbb{Z})=\left{x=\left.\left{x_l\right}_{l=-\infty}^{\infty}\left|\sum_{l=-\infty}^{\infty}\right| x_l\right|^2<\infty\right} . $$
计算机代写|深度学习代写deep learning代考|Basis and Frames
The set of vectors $\left{x_1, \cdots, x_k\right}$ is said to be linearly independent if a linear combination denoted by $$ \alpha_1 \boldsymbol{x}_1+\alpha_2 \boldsymbol{x}_2+\cdots+\alpha_k \boldsymbol{x}_k=\mathbf{0} $$ implies that $$ \alpha_i=0, \quad i=1, \cdots, k . $$ The set of all vectors reachable by taking linear combinations of vectors in a set $\mathcal{S}$ is called the span of $\mathcal{S}$. For example, if $\mathcal{S}=\left{\boldsymbol{x}i\right}{i=1}^k$, then we have $$ \operatorname{span}(\mathcal{S})=\left{\sum_{i=1}^k \alpha_i \boldsymbol{x}i, \forall \alpha_i \in \mathbb{R}\right} . $$ A set $\mathcal{B}=\left{\boldsymbol{b}_i\right}{i=1}^m$ of elements (vectors) in a vector space $\mathcal{V}$ is called a basis, if every element of $\mathcal{V}$ may be written in a unique way as a linear combination of elements of $\mathcal{B}$, that is, for all $\boldsymbol{f} \in \mathcal{V}$, there exists unique coefficients $\left{c_i\right}$ such that $$ \boldsymbol{f}=\sum_{i=1}^m c_i \boldsymbol{b}_i . $$ A set $\mathcal{B}$ is a basis of $\mathcal{V}$ if and only if every element of $\mathcal{B}$ is linearly independent and $\operatorname{span}(\mathcal{B})=\mathcal{V}$. The coefficients of this linear combination are referred to as expansion coefficients, or coordinates on $\mathcal{B}$ of the vector. The elements of a basis are called basis vectors. In general, for $m$-dimensional spaces, the number of basis vectors is $m$. For example, when $\mathcal{V}=\mathbb{R}^2$, the following two sets are some examples of a basis: $$ \left{\left[\begin{array}{l} 1 \ 0 \end{array}\right],\left[\begin{array}{l} 0 \ 1 \end{array}\right]\right}, \quad\left{\left[\begin{array}{l} 1 \ 1 \end{array}\right],\left[\begin{array}{c} 1 \ -1 \end{array}\right]\right} . $$
A metric space $(X, d)$ is a set $\mathcal{X}$ together with a metric $d$ on the set. Here, a metric is a function that defines a concept of distance between any two members of the set, which is formally defined as follows.
Definition 1.1 (Metric) A metric on a set $\mathcal{X}$ is a function called the distance $d$ : $\mathcal{X} \times \mathcal{X} \mapsto \mathbb{R}{+}$, where $\mathbb{R}{+}$is the set of non-negative real numbers. For all $x, y, z \in \mathcal{X}$, this function is required to satisfy the following conditions:
$d(x, y) \geq 0$ (non-negativity).
$d(x, y)=0$ if and only if $x=y$.
$d(x, y)=d(y, x)$ (symmetry).
$d(x, z) \leq d(x, y)+d(y, z)$ (triangle inequality). A metric on a space induces topological properties like open and closed sets, which lead to the study of more abstract topological spaces. Specifically, about any point $x$ in a metric space $\mathcal{X}$, we define the open ball of radius $r>0$ about $x$ as the set $$ B_r(x)={y \in \mathcal{X}: d(x, y)0$ such that $B_r(x)$ is contained in $U$. The complement of an open set is called closed.
A sequence $\left(x_n\right)$ in a metric space $\mathcal{X}$ is said to converge to the limit $x \in \mathcal{X}$ if and only if for every $\varepsilon>0$, there exists a natural number $N$ such that $d\left(x_n, x\right)<\varepsilon$ for all $n>N$. A subset $\mathcal{S}$ of the metric space $X$ is closed if and only if every sequence in $\mathcal{S}$ that converges to a limit in $X$ has its limit in $\mathcal{S}$. In addition, a sequence of elements $\left(x_n\right)$ is a Cauchy sequence if and only if for every $\varepsilon>0$, there is some $N \geq 1$ such that $$ d\left(x_n, x_m\right)<\varepsilon, \quad \forall m, n \geq N . $$ We are now ready to define the important concepts in metric spaces.
计算机代写|深度学习代写deep learning代考|Vector Space
A vector space $\mathcal{V}$ is a set that is closed under finite vector addition and scalar multiplication. In machine learning applications, the scalars are usually members of real or complex values, in which case $\mathcal{V}$ is called a vector space over real numbers, or complex numbers.
For example, the Euclidean $n$-space $\mathbb{R}^n$ is called a real vector space, and $\mathbb{C}^n$ is called a complex vector space. In the $n$-dimensional Euclidean space $\mathbb{R}^n$, every element is represented by a list of $n$ real numbers, addition is component-wise, and scalar multiplication is multiplication on each term separately. More specifically, we define a column $n$-real-valued vector $x$ to be an array of $n$ real numbers, denoted by $$ \boldsymbol{x}=\left[\begin{array}{c} x_1 \ x_2 \ \vdots \ x_n \end{array}\right]=\left[\begin{array}{llll} x_1 & x_2 & \cdots & x_n \end{array}\right]^{\top} \in \mathbb{R}^n, $$
where the superscript ${ }^{\top}$ denotes the adjoint. Note that for a real vector, the adjoint is just a transpose. Then, the sum of the two vectors $\boldsymbol{x}$ and $\boldsymbol{y}$, denoted by $\boldsymbol{x}+\boldsymbol{y}$, is defined by $$ \boldsymbol{x}+\boldsymbol{y}=\left[x_1+y_1 x_2+y_2 \cdots x_n+y_n\right]^{\top} . $$ Similarly, the scalar multiplication with a scalar $\alpha \in \mathbb{R}$ is defined by $$ \alpha \boldsymbol{x}=\left[\alpha x_1 \alpha x_2 \cdots \alpha x_n\right]^{\top} . $$ In addition, we formally define the inner product and the norm in a vector space as follows.
Definition $1.5$ (Inner Product) Let $\mathcal{V}$ be a vector space over $\mathbb{R}$. A function $(\cdot, \cdot) \cdot \mathcal{V}: \mathcal{V} \times \mathcal{V} \mapsto \mathbb{R}$ is an inner product on $\mathcal{V}$ if:
Linear: $\left\langle\alpha_1 \boldsymbol{f}1+\alpha_2 \boldsymbol{f}_2, \boldsymbol{g}\right\rangle{\mathcal{V}}=\alpha_1\left\langle\boldsymbol{f}1, \boldsymbol{g}\right\rangle{\mathcal{V}}+\alpha_2\left\langle\boldsymbol{f}2, \boldsymbol{g}\right\rangle{\mathcal{V}}$ for all $\alpha_1, \alpha_2 \in \mathbb{R}$ and $f_1, f_2, g \in \mathcal{V}$
$\langle\boldsymbol{f}, \boldsymbol{f}\rangle_{\mathcal{V}} \geq 0$ and $\langle\boldsymbol{f}, \boldsymbol{f}\rangle_{\mathcal{V}}=0$ if and only if $\boldsymbol{f}=\mathbf{0}$. If the underlying vector space $\mathcal{V}$ is obvious, we usually represent the inner product without the subscript $\mathcal{V}$, i.e. $\langle\boldsymbol{f}, \boldsymbol{g}\rangle$. For example, the inner product of the two vectors $f, g \in \mathbb{R}^n$ is defined as $$ \langle\boldsymbol{f}, \boldsymbol{g}\rangle=\sum_{i=1}^n f_i g_i=\boldsymbol{f}^{\top} \boldsymbol{g} . $$ Two nonzero vectors $\boldsymbol{x}, \boldsymbol{y}$ are called orthogonal when $$ \langle\boldsymbol{x}, \boldsymbol{y}\rangle=0, $$
线性: $\$$ Meft 1 anglelalpha_1 $\mathrm{bboldsymbol}{f} 1+\mid a l p h a _2 ~ b$ boldsymbol${f} 2$, |boldsymbol{g}|right|rangle ${\backslash m a t h c a \mid{V}}$ forall $\backslash a l p h a _1$, Ialpha_2 $\backslash$ in $\backslash m a t h b b{R}$ andf_1, f_2,g in Imathcal{V}\$
If the prediction target is whether a ship is detained in an inspection, which is a binary variable with ” 1 ” indicating ship detention and ” 0 ,” otherwise. Neither simple nor multiple linear regression model can be directly applied to this classification problem, as the output is continuous and unbounded, while we expect the output to be categorical and bounded. An intuitive method is to set a threshold to predict the probability of $y=1$ given the input features $\mathbf{x}$, i.e., $P(y=1 \mid \mathbf{x})$. The unit-step function is a popular method to map a continuous output (denoted by $z$ ) to a probability (denoted by $\widetilde{y}$ ), which takes the following form as shown in Figure 5.1.
However, Figure $5.1$ shows that the final output given by the unit-step function is discontinuous, making it hard to be optimized. Therefore, a continuous, monotonic, and differentiable surrogate function of the unit-step function called logistic function taking the following form is used: $$ \widetilde{y}=\frac{1}{1+e^{-z}}, $$ here $z=\tilde{\boldsymbol{x}} \tilde{\boldsymbol{w}}$ is the continuous output given by a multiple linear regression model. An illustration of the logistic function is shown in Figure 5.2. Equation (5.12) can also be transformed as follows:
$$ \begin{aligned} \tilde{y} & =\frac{1}{1+e^{-z}} \ \Rightarrow z & =\tilde{\boldsymbol{x}} \tilde{\boldsymbol{w}}=\ln \frac{\widetilde{y}}{1-\widetilde{y}} \end{aligned} $$ In Equation (5.13), $\widetilde{y}$ is the probability of a sample with features $\mathbf{x}$ to be of class “1” and $1-\widetilde{y}$ is the probability to be of class ” 0. ” Therefore, $\frac{\widetilde{y}}{1-\widetilde{y}}$ is the relative probability of sample $\mathbf{x}$ to be of class ” 1 ,” which is called odds. $\ln \frac{\widetilde{y}}{1-\widetilde{y}}$ is the natural log of odds, and is called log odds, or logit. Therefore, Equation (5.13) can be interpreted as using the output of a multiple linear regression model to approximate the log odds, so as to map a continuous target to a probability.
计算机代写|机器学习代写machine learning代考|Ridge regression
Ridge regression imposes a penalty on the size of the regression coefficients using L2 regularization, where the loss function takes the following form: $$ l=\sum_{i=1}^n\left(y_i-b-\sum_{j=1}^m x_{i j} w_j\right)^2+\lambda \sum_{j=1}^m w_j^2, \text { where } \lambda>0 . $$ $\lambda$ is a complexity parameter to control the degree of shrinkage: a larger $\lambda$ means a greater amount of shrinkage. The objective of ridge regression is to find the optimal $\mathbf{w}^$ such that $$ \mathbf{w}^=\arg \min {\mathbf{w}}\left{\sum{i=1}^n\left(y_i-b-\sum_{j=1}^m x_{i j} w_j\right)^2+\lambda \sum_{j=1}^m w_j^2\right} . $$ This is equivalent to solving the following optimization problem: $$ \begin{array}{r} \mathbf{w}^*=\arg \min {\mathrm{w}}\left{\sum{i=1}^n\left(y_i-b-\sum_{j=1}^m x_{i j} w_j\right)^2\right}, \ \text { s.t. } \sum_{j=1}^m w_j^2 \leq t, \end{array} $$ here there is a one-to-one relationship between $\lambda$ and $t$, and the size constraint on the parameters (i.e. constraint on parameter values) is imposed explicitly in Equation (5.18). Ridge regression is effective to alleviate the problem of high variance brought about by correlated variables in multiple linear regression by shrinking coefficients close to (but not exactly) zero. It is also noted that bias $b$, which is not directly related to the parameters, is excluded from the penalty terms, as they aim to regularize the coefficients of parameters. Its value should also be determined in Equation (5.18). An example of using ridge regression to predict ship deficiency number using the features of Example $5.2$ based on scikit-learn API is as follows. Example 5.5: Min-max scaling is also first applied to numerical features age, GT, last inspection time, and last deficiency number. Ridge regression with hyperparameter tuning for $\lambda$ based on 5 -fold cross-validation can easily be implemented by the RidgeCV method provided by scikit-learn API.
计算机代写|机器学习代写machine learning代考|The background and development of PSC
PSC is then developed, aiming to inspect foreign visiting ships in national ports to verify that “the condition of the ship and its equipment comply with the requirements of international regulations and that the ship is manned and operated in compliance with these rules” as mentioned by the IMO [7]. During an inspection, a condition onboard that does not comply with the requirements of the relevant convention is called a deficiency. The number and nature of the deficiencies found onboard determine the corresponding action taken by the PSC officer(s) (PSCO[s]). Common actions include rectifying a deficiency at the next port within 14 days or before departure and ship detention. Especially, ship detention is an intervention action taken by the port state that prevents a severely substandard ship from proceeding to sea until it would not present danger to the ship or persons onboard as well as to the marine environment.
PSC inspection is carried out on a regional level. The Memorandum of Understanding (MoU) on PSC was first signed in 1982 by 14 European countries, which is called the Paris MoU and marks the establishment of PSC. Since then, the number of member states of the Paris MoU has constantly increased, and it contains 27 participating maritime administrations covering the waters of the European coastal States and the North Atlantic basin as of January 2022. Another large regional MoU is in the Far East responsible for the Asia Pacific region, which is called Tokyo MoU and was signed in 1993. It now contains 22 member states. In addition, there are another seven MoUs on PSC, namely Acuerdo de Viña del Mar (Latin America), Caribbean MoU (Caribbean), Abuja MoU (West and Central Africa), Black Sea MoU (the Black Sea region), Mediterranean $\mathrm{MoU}$ (the Mediterranean), Indian Ocean $\mathrm{MoU}$ (the Indian Ocean), and the Riyadh MoU. The main objectives of constructing MoUs are constructing an improved and harmonized PSC system, strengthening cooperation and information exchange among member states, and avoiding multiple inspections within a short period. Apart from the nine regional MoUs, the United States Coast Guard maintains the tenth PSC regime.
计算机代写|机器学习代写machine learning代考|Simple linear regression and the least squares
Simple linear regression uses only one feature to predict the target. For example, we use ship age to predict the number of deficiencies of a PSC inspection. Denote the training set with $n$ samples by $D=\left{\left(x_1, y_1\right),\left(x_2, y_2\right), \ldots,\left(x_n, y_n\right)\right}$ and the feature vector by $x$. Simple linear regression aims to develop a model taking the following form: $$ \hat{y}i=w x_i+b, $$ where $\hat{y}_i$ is the predicted target for sample $i, w$ is the parameter weight and $b$ is the bias. $w$ and $b$ need to be learned from $D$. Then, a natural question is: what are good $w$ and $b$ ? Or in other words, how to find the values of $w$ and $b$ such that the predicted target is as accurate as possible? The key point of developing a simple linear regression model is to evaluate the difference between $\hat{y}_i$ and $y, i=1, \ldots, n$ using the loss function and to adopt the values of $w$ and $b$ that minimize the loss function. In a regression problem, the most commonly used loss function is the mean squared error (MSE), where $M S E=\frac{1}{n} \sum{i=1}^n\left(y_i-\hat{y}_i\right)^2$. Therefore, the learning objective of simple linear regression is to find the optimal $\left(w^, b^\right)$ such that the MSE is minimized. The above idea can be presented by the following mathematical functions:
$$ \begin{aligned} \left(w^, x^\right) & =\underset{(w, b)}{\arg \min } \sum_{i=1}^n\left(y_i-\hat{y}i\right)^2 \ & =\underset{(w, b)}{\arg \min } \sum{i=1}^n\left(y_i-w x_i-b\right)^2 \end{aligned} $$ This idea is called the least squares method. The intuition behind it is to minimize the sum of lengths of the vertical lines between all the samples and the regression line determined by $w$ and $b$. It can easily be shown that $M S E$ is convex in $w$ and $b$, and thus $\left(w^, b^\right)$ can be found by $$ \begin{aligned} \frac{\partial M S E}{\partial w} & =2\left(\sum_{i=1}^n x_i\left[w x_i-\left(y_i-b\right)\right]\right)=0 \ \Rightarrow w^* & =\frac{\sum_{i=1}^n y_i\left(x_i-\frac{1}{n} \sum_{i=1}^n x_i\right)}{\sum_{i=1}^n x_i^2-\frac{1}{n}\left(\sum_{i=1}^n x_i\right)^2} \end{aligned} $$ The optimal $w^$ is first found by Equation (5.2), and then it can be used to calculate the optimal value of $b$, denoted by $b^$, as follows: $$ \begin{aligned} & \frac{\partial M S E}{\partial b}=2\left(\sum_{i=1}^n w^* x_i+b-y_i\right)=0 \ & \Rightarrow b^=\frac{1}{n} \sum_{i=1}^n\left(y_i-w^ x_i\right) \end{aligned} $$ Simple linear regression can easily be realized by scikit-learn API [1] in Python. Here is ann exannplè of using ship aage to predict ship deficiencyy number using simplé linear regression.
A majority of cargoes in supermarkets, such as fruits and vegetables, kitchen appliances, furniture, garments, meats, fish, dairy products, and toys, are transported in containers by ship. Containers are usually expressed in terms of TEUs, a box that is 20 feet long $(6.1 \mathrm{~m})$. Throughout this book, unless otherwise specified, we use “TEU” to express “the number of containers” or “the volume of containers.”
Containers are transported by ship on liner services, which are similar to bus services. Figure $1.1$ is the Central China $2(\mathrm{CC} 2)$ service operated by Orient Overseas Container Line (OOCL), a Hong Kong-based shipping company. We call it a service, a route, or a service route. A route is a loop, and the port rotation of a route is the sequence of ports of call on the route. Any port of call can be defined as the first port of call. For example, if we define Ningbo as the first port of call, then Shanghai is the second port of call, and Los Angeles is the third port of call. We can therefore represent the port rotation of the route as follows: Ningbo (1) $\rightarrow$ Shanghai $(2) \rightarrow$ Los Angeles $(3) \rightarrow$ Ningbo (1)
Note that on a route, different ports of call may be the same physical port. For example, the Central China 1 (CC1) service of OOCL shown in Figure $1.2$ has the port rotation below:
Both the second and the seventh ports of call are Kwangyang, and both the third and the sixth ports of call are Pusan.
A leg is the voyage from one port of call to the next. Leg $i$ is the voyage from the $i$ th port of call to port of call $i+1$. The last leg is the voyage from the last port of call to the first port of call. On CCl, the second leg is the voyage from Kwangyang (the second) to Pusan (the third), and the seventh leg is the voyage from Kwangyang (the seventh) to Shanghai (the first).
The rotation time of a route is the time required for a ship to start from the first port of call, visit all ports of call on the route, and return to the first port of call. As can be read from Figures $1.1$ and 1.2, the rotation time of $\mathrm{CC} 2$ is 35 days*, and the rotation time of $\mathrm{CC} 1$ is 42 days. Each route provides a weekly frequency, which means that each port of call is visited on the same day every week. Therefore, a string of five ships are deployed on $\mathrm{CC} 2$, and the headway between two adjacent ships is 7 days. These five ships usually have the same TEU capacity and other characteristics. Unless otherwise specified, we assume weekly frequencies for all routes.
计算机代写|机器学习代写machine learning代考|Key issues in maritime transport
Maritime transport is a highly globalized industry in terms of operation and management. For ship operation, ocean-going vessels sail on the high seas from the origin port in one country/region to the destination port in another country/region. For ship management, parties responsible for ship ownership, crewing, and operating may locate in different countries and regions. Even the country of registration, i.e., ship flag state, may not have a direct link and connection with a ship’s activities as the ship may not frequently visit the ports belonging to its flag state. For inland countries such as Mongolia, the ships registered under it never visit its ports. Such complex and disintegrated nature of the shipping industry makes it hard to control and regulate international shipping activities, and thus pose danger to maritime safety, the marine environment, and the crew and cargoes carried by ocean-going vessels. Shipping is one of the world’s most dangerous industries due to the complex and ever-changing environment at sea, the dangerous goods carried, and the difficulties in search and rescue. Safety at sea is always put at the highest priority in ship operation and management. It is widely believed that the most effective and efficient way of improving safety at sea is to develop international regulations that should be followed by all shipping nations [1]. A unified and permanent international body was expected to be established for regulation and supervision by several nations from the mid-19th century onward, and the hopes came true after the International Maritime Organization (IMO, whose original name was Inter-Governmental Maritime Consultative Organization) was established at an international conference in Geneva held in 1948. Through hard efforts of all parties, the members of IMO met for the first time in 1959, one year after the IMO convention came into force. The IMO’s task was to adopt a new version of the most important conventions on maritime safety, i.e., the International Convention for the Safety of Life at Sea, which specifies minimum safety standards for ship construction, equipment, and operation. It covers comprehensive aspects of shipping safety, including vessel construction, fire safety, life-saving arrangements, radio communications, navigation safety, cargo carriage, dangerous goods transporting, the mandatory of the International Safety Management (ISM) code, verification of compliance, and measures for specific ships, and is constantly amended [2]. The Maritime Safety Committee is responsible for every aspect of maritime safety and security, and it is the highest technical body of the IMO.
计算机代写|机器学习代写machine learning代考|Intuition and Main Results
Consider first the training error $E_{\text {train }}$ defined in (5.3). Since $$ \operatorname{tr} \mathbf{Y} \mathbf{Q}^2(\gamma) \mathbf{Y}^{\boldsymbol{\top}}=-\frac{\partial}{\partial \gamma} \operatorname{tr} \mathbf{Y} \mathbf{Q}(\gamma) \mathbf{Y}^{\top}, $$ a deterministic equivalent for the resolvent $\mathbf{Q}(\gamma)$ is sufficient to acceess the asymptotic behavior of $E_{\text {train }}$. With a linear activation $\sigma(t)=t$, the resolvent of interest $$ \mathbf{Q}(\gamma)=\left(\frac{1}{n} \sigma(\mathbf{W X})^{\top} \sigma(\mathbf{W} \mathbf{X})+\gamma \mathbf{I}n\right)^{-1} $$ is the same as in Theorem 2.6. In a sense, the evaluation of $\mathbf{Q}(\gamma)$ (and subsequently $\left.E{\text {train }}\right)$ calls for an extension of Theorem $2.6$ to handle the case of nonlinear activations. Recall now that the main ingredients to derive a deterministic equivalent for (the linear case) $\mathbf{Q}=\left(\mathbf{X}^{\top} \mathbf{W}^{\top} \mathbf{W} \mathbf{X} / n+\gamma \mathbf{I}n\right)^{-1}$ are (i) $\mathbf{X}^{\top} \mathbf{W}^{\top}$ has i.i.d. columns and (ii) its $i$ th column $\left[\mathbf{W}^{\top}\right]_i$ has i.i.d. (or linearly dependent) entries so that the key Lemma $2.11$ applies. These hold, in the linear case, due to the i.i.d. property of the entries of $\mathbf{W}$. However, while for Item (i), the nonlinear $\Sigma^{\top}=\sigma(\mathbf{W X})^{\top}$ still has i.i.d. columns, and for Item (ii), its $i$ th column $\sigma\left(\left[\mathbf{X}^{\top} \mathbf{W}^{\top}\right]{. i}\right)$ no longer has i.i.d. or linearly dependent entries. Therefore, the main technical difficulty here is to obtain a nonlinear version of the trace lemma, Lemma 2.11. That is, we expect that the concentration of quadratic forms around their expectation remains valid despite the application of the entry-wise nonlinear $\sigma$. This naturally falls into the concentration of measure theory discussed in Section $2.7$ and is given by the following lemma.
Lemma 5.1 (Concentration of nonlinear quadratic form, Louart et al. [2018, Lemma 1]). For $\mathbf{w} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)$, 1-Lipschitz $\sigma(\cdot)$, and $\mathbf{A} \in \mathbb{R}^{n \times n}, \mathbf{X} \in \mathbb{R}^{p \times n}$ such that $|\mathbf{A}| \leq 1$ and $|\mathbf{X}|$ bounded with respect to $p, n$, then, $$ \mathbb{P}\left(\left|\frac{1}{n} \sigma\left(\mathbf{w}^{\top} \mathbf{X}\right) \mathbf{A} \sigma\left(\mathbf{X}^{\top} \mathbf{w}\right)-\frac{1}{n} \operatorname{tr} \mathbf{A} \mathbf{K}\right|>t\right) \leq C e^{-c n \min \left(t, t^2\right)} $$ for some $C, c>0, p / n \in(0, \infty)$ with ${ }^2$ $$ \mathbf{K} \equiv \mathbf{K}{\mathbf{X X}} \equiv \mathbb{E}{\mathbf{w} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)}\left[\sigma\left(\mathbf{X}^{\top} \mathbf{w}\right) \sigma\left(\mathbf{w}^{\boldsymbol{\top}} \mathbf{X}\right)\right] \in \mathbb{R}^{n \times n} $$
计算机代写|机器学习代写machine learning代考|Consequences for Learning with Large Neural Networks
To validate the asymptotic analysis in Theorem $5.1$ and Corollary $5.1$ on real-world data, Figures $5.2$ and $5.3$ compare the empirical MSEs with their limiting behavior predicted in Corollary 5.1, for a random network of $N=512$ neurons and various types of Lipschitz and non-Lipschitz activations $\sigma(\cdot)$, respectively. The regressor $\boldsymbol{\beta} \in \mathbb{R}^p$ maps the vectorized images from the Fashion-MNIST dataset (classes 1 and 2) [Xiao et al., 2017] to their corresponding uni-dimensional ( $d=1$ ) output labels $\mathbf{Y}{1 i}, \hat{\mathbf{Y}}{1 j} \in$ ${\pm 1}$. For $n, p, N$ of order a few hundreds (so not very large when compared to typical modern neural network dimensions), a close match between theory and practice is observed for the Lipschitz activations in Figure 5.2. The precision is less accurate but still quite good for the case of non-Lipschitz activations in Figure 5.3, which, we recall, are formally not supported by the theorem statement – here for $\sigma(t)=1-t^2 / 2$, $\sigma(t)=1_{t>0}$, and $\sigma(t)=\operatorname{sign}(t)$. For all activations, the deviation from theory is more acute for small values of regularization $\gamma$.
Figures $5.2$ and $5.3$ confirm that while the training error is a monotonically increasing function of the regularization parameter $\gamma$, there always exists an optimal value for $\gamma$ which minimizes the test error. In particular, the theoretical formulas derived in Corollary $5.1$ allow for a (data-dependent) fast offline tuning of the hyperparameter $\gamma$ of the network, in the setting where $n, p, N$ are not too small and comparable. In terms of activation functions (those listed here), we observe that, on the Fashion-MNIST dataset, the ReLU nonlinearity $\sigma(t)=\max (t, 0)$ is optimal and achieves the minimum test error, while the quadratic activation $\sigma(t)=1-t^2 / 2$ is the worst and produces much higher training and test errors compared to others. This observation will be theoretically explained through a deeper analysis of the corresponding kernel matrix $\mathbf{K}$, as performed in Section 5.1.2. Lastly, although not immediate at first sight, the training and test error curves of $\sigma(t)=1_{t>0}$ and $\sigma(t)=\operatorname{sign}(t)$ are indeed the same, up to a shift in $\gamma$, as a consequence of the fact that $\operatorname{sign}(t)=2 \cdot 1_{t>0}-1$.
Although much less popular than modern deep neural networks, neural networks with random fixed weights are simpler to analyze. Such networks have frequently arisen in the past decades as an appropriate solution to handle the possibly restricted number of training data, to reduce the computational and memory complexity and, from another viewpoint, can be seen as efficient random feature extractors. These neural networks in fact find their roots in Rosenblatt’s perceptron [Rosenblatt, 1958] and have then been many times revisited, rediscovered, and analyzed in a number of works, both in their feedforward [Schmidt et al., 1992] and recurrent [Gelenbe, 1993] versions. The simplest modern versions of these random networks are the so-called extreme learning machine [Huang et al., 2012] for the feedforward case, which one may seem as a mere linear regression method on nonlinear random features, and the echo state network [Jaeger, 2001] for the recurrent case. Also see Scardapane and Wang [2017] for a more exhaustive overview of randomness in neural networks.
It is also to be noted that deep neural networks are initialized at random and that random operations (such as random node deletions or voluntarily not-learning a large proportion of randomly initialized neural network weights, that is, random dropout) are common and efficient in neural network learning [Srivastava et al., 2014, Frankle and Carbin, 2019]. We may also point the recent endeavor toward neural network “learning without backpropagation,” which, inspired by biological neural networks (which naturally do not operate backpropagation learning), proposes learning mechanisms with fixed random backward weights and asymmetric forward learning procedures [Lillicrap et al., 2016, Nøkland, 2016, Baldi et al., 2018, Frenkel et al., 2019, Han et al., 2019]. As such, the study of random neural network structures may be instrumental to future improved understanding and designs of advanced neural network structures.
As shall be seen subsequently, the simple models of random neural networks are to a large extent connected to kernel matrices. More specifically, the classification or regression performance at the output of these random neural networks are functionals of random matrices that fall into the wide class of kernel random matrices, yet of a slightly different form than those studied in Section 4. Perhaps more surprisingly, this connection still exists for deep neural networks which are (i) randomly initialized and (ii) then trained with gradient descent, via the so-called neural tangent kernel [Jacot et al., 2018] by considering the “infinitely many neurons” limit, that is, the limit where the network widths of all layers go to infinity simultaneously. This close connection between neural networks and kernels has triggered a renewed interest for the theoretical investigation of deep neural networks from various perspectives including optimization [Du et al., 2019, Chizat et al., 2019], generalization [Allen-Zhu et al., 2019, Arora et al., 2019a, Bietti and Mairal, 2019], and learning dynamics [Lee et al., 2020, Advani et al., 2020, Liao and Couillet, 2018a]. These works shed new light on our theoretical understanding of deep neural network models and specifically demonstrate the significance of studying simple networks with random weights and their associated kernels to assess the intrinsic mechanisms of more elaborate and practical deep networks.
计算机代写|机器学习代写machine learning代考|Regression with Random Neural Networks
Throughout this section, we consider a feedforward single-hidden-layer neural network, as illustrated in Figure $5.1$ (displayed, for notational convenience, from right to left). A similar class of single-hidden-layer neural network models, however with a recurrent structure, will be discussed later in Section 5.3.
Given input data $\mathbf{X}=\left[\mathbf{x}_1, \ldots, \mathbf{x}_n\right] \in \mathbb{R}^{p \times n}$, we denote $\Sigma \equiv \sigma(\mathbf{W} \mathbf{X}) \in \mathbb{R}^{N \times n}$ the output of the first layer comprising $N$ neurons. This output arises from the premultiplication of $\mathbf{X}$ by some random weight matrix $\mathbf{W} \in \mathbb{R}^{N \times p}$ with i.i.d. (say standard Gaussian) entries and the entry-wise application of the nonlinear activation function $\sigma: \mathbb{R} \rightarrow \mathbb{R}$. As such, the columns $\sigma\left(\mathbf{W x}_i\right)$ of $\Sigma$ can be seen as random nonlinear features of $\mathbf{x}_i$. The second layer weight $\boldsymbol{\beta} \in \mathbb{R}^{N \times d}$ is then learned to adapt the feature matrix $\Sigma$ to some associated target $\mathbf{Y}=\left[\mathbf{y}_1, \ldots, \mathbf{y}_n\right] \in \mathbb{R}^{d \times n}$, for instance, by minimizing the Frobenius norm $\left|\mathbf{Y}-\boldsymbol{\beta}^{\top} \Sigma\right|_F^2$.
Remark 5.1 (Random neural networks, random feature maps and random kernels). The columns of $\Sigma$ may be seen as the output of the $\mathbb{R}^p \rightarrow \mathbb{R}^N$ random feature map $\phi: \mathbf{x}i \mapsto \sigma\left(\mathbf{W} \mathbf{x}_i\right)$ for some given $\mathbf{W} \in \mathbb{R}^{N \times p}$. In Rahimi and Recht [2008], it is shown that, for every nonnegative definite “shift-invariant” kernel of the form $(\mathbf{x}, \mathbf{y}) \mapsto f\left(|\mathbf{x}-\mathbf{y}|^2\right)$, there exist appropriate choices for $\sigma$ and the law of the entries of $\mathbf{W}$ so that as the number of neurons or random features $N \rightarrow \infty$, $$ \sigma\left(\mathbf{W} \mathbf{x}_i\right)^{\top} \sigma\left(\mathbf{W} \mathbf{x}_j\right) \stackrel{\text { a.s. }}{\longrightarrow} f\left(\left|\mathbf{x}_i-\mathbf{x}_j\right|^2\right) . $$ As such, for large enough $N$ (that in general must scale with $n, p$ ), the bivariate function $(\mathbf{x}, \mathbf{y}) \mapsto \sigma(\mathbf{W} \mathbf{x})^{\top} \sigma(\mathbf{W y})$ approximates a kernel function of the type $f\left(|\mathbf{x}-\mathbf{y}|^2\right)$ studied in Chapter 4. This result is then generalized, in subsequent works, to a larger family of kernels including inner-product kernels [Kar and Karnick, 2012], additive homogeneous kernels [Vedaldi and Zisserman, 2012], etc. Another, possibly more marginal, connection with the previous sections is that $\sigma\left(\mathbf{w}^{\top} \mathbf{x}\right)$ can be interpreted as a “properly scaling” inner-product kernel function applied to the “data” pair $\mathbf{w}, \mathbf{x} \in \mathbb{R}^p$. This technically induces another strong relation between the study of kernels and that of neural networks. Again, similar to the concentration of (Euclidean) distance extensively explored in this chapter, the entry-wise convergence in (5.1) does not imply convergence in the operator norm sense, which, as we shall see, leads directly to the so-called “double descent” test curve in random feature/neural network models. If the network output weight matrix $\boldsymbol{\beta}$ is designed to minimize the regularized MSE $L(\boldsymbol{\beta})=\frac{1}{n} \sum{i=1}^n\left|\mathbf{y}_i-\boldsymbol{\beta}^{\top} \sigma\left(\mathbf{W x}_i\right)\right|^2+\gamma|\boldsymbol{\beta}|_F^2$, for some regularization parameter $\gamma>0$, then $\beta$ takes the explicit form of a ridge-regressor ${ }^1$ $$ \beta \equiv \frac{1}{n} \Sigma\left(\frac{1}{n} \Sigma^{\top} \Sigma+\gamma \mathbf{I}_n\right)^{-1} \mathbf{Y}^{\top}, $$ which follows from differentiating $L(\boldsymbol{\beta})$ with respect to $\boldsymbol{\beta}$ to obtain $0=\gamma \boldsymbol{\beta}+$ $\frac{1}{n} \Sigma\left(\Sigma^{\top} \boldsymbol{\beta}-\mathbf{Y}^{\top}\right)$ so that $\left(\frac{1}{n} \Sigma \Sigma^{\top}+\gamma \mathbf{I}_N\right) \boldsymbol{\beta}=\frac{1}{n} \Sigma \mathbf{Y}^{\top}$ which, along with $\left(\frac{1}{n} \Sigma \Sigma^{\top}+\right.$ $\left.\gamma \mathbf{I}_N\right)^{-1} \Sigma=\Sigma\left(\frac{1}{n} \Sigma^{\top} \Sigma+\gamma \mathbf{I}_n\right)^{-1}$ for $\gamma>0$, gives the result.
Before the present chapter, the first part of the book was mostly concerned with the sample covariance matrix model $\mathbf{X} \mathbf{X}^{\top} / n$ (and more marginally with the Wigner model $\mathbf{X} / \sqrt{n}$ for symmetric $\mathbf{X}$ ), where the columns of $\mathbf{X}$ are independent and the entries of each column are independent or linearly dependent. Historically, this model and its numerous variations (with a variance profile, with right-side correlation, summed up to other independent matrices of the same form, etc.) have covered most of the mathematical and applied interest of the first two decades (since the early nineties) of intense random matrix advances. The main drivers for these early developments were statistics, signal processing, and wireless communications. The present chapter leaped much further in considering now random matrix models with possibly highly correlated entries, with a specific focus on kernel matrices. When (moderately) largedimensional data are considered, the intuition and theoretical understanding of kernel matrices in small-dimensional setting being no longer accurate, random matrix theory provides accurate (and asymptotically exact) performance assessment along with the possibility to largely improve the performance of kernel-based machine learning methods. This, in effect, creates a small revolution in our understanding of machine learning on realistic large datasets.
A first important finding of the analysis of large-dimensional kernel statistics reported here is the ubiquitous character of the Marčenko-Pastur and the semi-circular laws. As a matter of fact, all random matrix models studied in this chapter, and in particular the kernel regimes $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j / p\right)$ (which concentrate around $f(0)$ ) and $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j / \sqrt{p}\right.$ ) (which tends to $f(\mathcal{N}(0,1))$ ), have a limiting eigenvalue distribution akin to a combination of the two laws. This combination may vary from case to case (compare for instance the results of Practical Lecture 3 to Theorem 4.4), but is often parametrized in a such way that the Marčenko-Pastur and semicircle laws appear as limiting cases (in the context of Practical Lecture 3, they correspond to the limiting cases of dense versus sparse kernels, and in Theorem $4.4$ to the limiting cases of linear versus “purely” nonlinear kernels).
计算机代写|机器学习代写machine learning代考|Practical Course Material
In this section, Practical Lecture 3 (that evaluates the spectral behavior of uniformly sparsified kernels) related to the present Chapter 4 is discussed, where we shall see, as for $\alpha-\beta$ and properly scaling kernels in Sections $4.2 .4$ and $4.3$ that, depending on the “level of sparsity,” a combination of Marčenko-Pastur and semicircle laws is observed. Practical Lecture Material 3 (Complexity-performance trade-off in spectral clustering with sparse kernel, Zarrouk et al. [2020]). In this exercise, we study the spectrum of a “punctured” version $\mathbf{K}=\mathbf{B} \odot\left(\mathbf{X}^{\top} \mathbf{X} / p\right.$ ) (with the Hadamard product $[\mathbf{A} \odot \mathbf{B}]{i j}=[\mathbf{A}]{i j}[\mathbf{B}]{i j}$ of the linear kernel $\mathbf{X}^{\top} \mathbf{X} / p$, with data matrix $\mathbf{X} \in \mathbb{R}^{p \times n}$ and a symmetric random mask-matrix $\mathbf{B} \in{0,1}^{n \times n}$ having independent $[\mathbf{B}]{i j} \sim \operatorname{Bern}(\boldsymbol{\epsilon})$ entries for $i \neq j$ (up to symmetry) and $[\mathbf{B}]_{i i}=b \in{0,1}$ fixed, in the limit $p, n \rightarrow \infty$ with $p / n \rightarrow c \in(0, \infty)$. This matrix mimics the computation of only a proportion $\epsilon \in(0,1)$ of the entries of $\mathbf{X}^{\top} \mathbf{X} / n$, and its impact on spectral clustering. Letting $\mathbf{X}=\left[\mathbf{x}_1, \ldots, \mathbf{x}_n\right]$ with $\mathbf{x}_i$ independently and uniformly drawn from the following symmetric two-class Gaussian mixture $$ \mathcal{C}_1: \mathbf{x}_i \sim \mathcal{N}\left(-\boldsymbol{\mu}, \mathbf{I}_p\right), \quad \mathcal{C}_2: \mathbf{x}_i \sim \mathcal{N}\left(+\boldsymbol{\mu}, \mathbf{I}_p\right) $$ for $\boldsymbol{\mu} \in \mathbb{R}^p$ such that $|\boldsymbol{\mu}|=O(1)$ with respect to $n, p$, we wish to study the effect of a uniform “zeroing out” of the entries of $\mathbf{X}^{\top} \mathbf{X}$ on the presence of an isolated spike in the spectrum of $\mathbf{K}$, and thus on the spectral clustering performance.
We will study the spectrum of $\mathbf{K}$ using Stein’s lemma and the Gaussian method discussed in Section 2.2.2. Let $\mathbf{Z}=\left[\mathbf{z}1, \ldots, \mathbf{z}_n\right] \in \mathbb{R}^{p \times n}$ for $\mathbf{z}_i=\mathbf{x}_i-(-1)^a \boldsymbol{\mu} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)$ with $\mathbf{x}_i \in \mathcal{C}_a$ and $\mathbf{M}=\mu \mathbf{j}^{\top}$ with $\mathbf{j}=\left[-\mathbf{1}{n / 2}, \mathbf{1}_{n / 2}\right]^{\top} \in \mathbb{R}^n$ so that $\mathbf{X}=\mathbf{M}+\mathbf{Z}$. First show that, for $\mathbf{Q} \equiv \mathbf{Q}(z)=\left(\mathbf{K}-z \mathbf{I}_n\right)^{-1}$, $$ \begin{aligned} \mathbf{Q}= & -\frac{1}{z} \mathbf{I}_n+\frac{1}{z}\left(\frac{\mathbf{Z}^{\boldsymbol{}} \mathbf{Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{Z}^{\boldsymbol{T}} \mathbf{M}}{p} \odot \mathbf{B}\right) \mathbf{Q} \ & +\frac{1}{z}\left(\frac{\mathbf{M}^{\boldsymbol{\top}} \mathbf{Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{M}^{\boldsymbol{T}} \mathbf{M}}{p} \odot \mathbf{B}\right) \mathbf{Q} . \end{aligned} $$ To proceed, we need to go slightly beyond the study of these four terms.
计算机代写|机器学习代写machine learning代考|Distance and Inner-Product Random Kernel Matrices
The most widely used kernel model in machine learning applications is the heat kernel $\mathbf{K}=\left{\exp \left(-\left|\mathbf{x}i-\mathbf{x}_j\right|^2 / 2 \sigma^2\right)\right}{i, j=1}^n$, for some $\sigma>0$. It is thus natural to start the large-dimensional analysis of kernel random matrices by focusing on this model. As mentioned in the previous sections, for the Gaussian mixture model above, as the dimension $p$ increases, $\sigma^2$ needs to scale as $O(p)$, so say $\sigma^2=\tilde{\sigma}^2 p$ for some $\tilde{\sigma}^2=O(1)$, to avoid evaluating the exponential at increasingly large values for $p$ large. As such, the prototypical kernel of present interest is $$ \mathbf{K}=\left{f\left(\frac{1}{p}\left|\mathbf{x}i-\mathbf{x}_j\right|^2\right)\right}{i, j-1}^n, $$ for $f$ a sufficiently smooth function (specifically, $f(t)=\exp \left(-t / 2 \tilde{\sigma}^2\right)$ for the heat kernel). As we will see though, it is much desirable not to restrict ourselves to $f(t)=\exp \left(-t / 2 \tilde{\sigma}^2\right)$ so to better appreciate the impact of the nonlinear kernel function $f$ on the (asymptotic) structural behavior of the kernel matrix $\mathbf{K}$.
计算机代写|机器学习代写machine learning代考|Euclidean Random Matrices with Equal Covariances
In order to get a first picture of the large-dimensional behavior of $\mathbf{K}$, let us first develop the distance $\left|\mathbf{x}_i-\mathbf{x}_j\right|^2 / p$ for $\mathbf{x}_i \in \mathcal{C}_a$ and $\mathbf{x}_j \in \mathcal{C}_b$, with $i \neq j$.
For simplicity, let us assume for the moment $\mathbf{C}_1=\cdots=\mathbf{C}_k=\mathbf{I}_p$ and recall the notation $\mathbf{x}_i=\boldsymbol{\mu}_a+\mathbf{z}_i$. We have, for $i \neq j$ that “entry-wise,” $$ \begin{aligned} \frac{1}{p}\left|\mathbf{x}_i-\mathbf{x}_j\right|^2= & \frac{1}{p}\left|\boldsymbol{\mu}_a-\boldsymbol{\mu}_b\right|^2+\frac{2}{p}\left(\boldsymbol{\mu}_a-\boldsymbol{\mu}_b\right)^{\top}\left(\mathbf{z}_i-\mathbf{z}_j\right) \ & +\frac{1}{p}\left|\mathbf{z}_i\right|^2+\frac{1}{p}\left|\mathbf{z}_j\right|^2-\frac{2}{p} \mathbf{z}_i^{\top} \mathbf{z}_j . \end{aligned} $$ For $\left|\mathbf{x}_i\right|$ of order $O(\sqrt{p})$, if $\left|\mu_a\right|=O(\sqrt{p})$ for all $a \in{1, \ldots, k}$ (which would be natural), then $\left|\mu_a-\mu_b\right|^2 / p$ is a priori of order $O(1)$ while, by the central limit theorem, $\left|\mathbf{z}_i\right|^2 / p=1+O\left(p^{-1 / 2}\right)$. Also, again by the central limit theorem, $\mathbf{z}_i^{\top} \mathbf{z}_j / p=$ $O\left(p^{-1 / 2}\right)$ and $\left(\mu_a-\mu_b\right)^{\top}\left(\mathbf{z}_i-\mathbf{z}_j\right) / p=O\left(p^{-1 / 2}\right)$
As a consequence, for $p$ large, the distance $\left|\mathbf{x}i-\mathbf{x}_j\right|^2 / p$ is dominated by $| \boldsymbol{\mu}_a-$ $\boldsymbol{\mu}_b |^2 / p+2$ and easily discriminates classes from the pairwise observations of $\mathbf{x}_i, \mathbf{x}_j$, making the classification asymptotically trivial (without having to resort to any kernel method). It is thus of interest consider the situations where the class distances are less significant to understand how the choices of kernel come into play in such more practical scenario. To this end, we now demand that $$ \left|\mu_a-\mu_b\right|=O(1), $$ which is also the minimal distance rate that can be discriminated from a mere Bayesian inference analysis, as thoroughly discussed in Section 1.1.3. Since the kernel function $f(\cdot)$ operates only on the distances $\left|\mathbf{x}_i-\mathbf{x}_j\right|$, we may even request (up to centering all data by, say, the constant vector $\frac{1}{n} \sum{a=1}^k n_a \mu_a$ ) for simplicity that $\left|\mu_a\right|=O(1)$ for each $a$.