statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

## 计算机代写|机器学习代写machine learning代考|Logistic regression

If the prediction target is whether a ship is detained in an inspection, which is a binary variable with ” 1 ” indicating ship detention and ” 0 ,” otherwise. Neither simple nor multiple linear regression model can be directly applied to this classification problem, as the output is continuous and unbounded, while we expect the output to be categorical and bounded. An intuitive method is to set a threshold to predict the probability of $y=1$ given the input features $\mathbf{x}$, i.e., $P(y=1 \mid \mathbf{x})$. The unit-step function is a popular method to map a continuous output (denoted by $z$ ) to a probability (denoted by $\widetilde{y}$ ), which takes the following form as shown in Figure 5.1.

However, Figure $5.1$ shows that the final output given by the unit-step function is discontinuous, making it hard to be optimized. Therefore, a continuous, monotonic, and differentiable surrogate function of the unit-step function called logistic function taking the following form is used:
$$\widetilde{y}=\frac{1}{1+e^{-z}},$$
here $z=\tilde{\boldsymbol{x}} \tilde{\boldsymbol{w}}$ is the continuous output given by a multiple linear regression model. An illustration of the logistic function is shown in Figure 5.2.
Equation (5.12) can also be transformed as follows:

\begin{aligned} \tilde{y} & =\frac{1}{1+e^{-z}} \ \Rightarrow z & =\tilde{\boldsymbol{x}} \tilde{\boldsymbol{w}}=\ln \frac{\widetilde{y}}{1-\widetilde{y}} \end{aligned}
In Equation (5.13), $\widetilde{y}$ is the probability of a sample with features $\mathbf{x}$ to be of class “1” and $1-\widetilde{y}$ is the probability to be of class ” 0. ” Therefore, $\frac{\widetilde{y}}{1-\widetilde{y}}$ is the relative probability of sample $\mathbf{x}$ to be of class ” 1 ,” which is called odds. $\ln \frac{\widetilde{y}}{1-\widetilde{y}}$ is the natural log of odds, and is called log odds, or logit. Therefore, Equation (5.13) can be interpreted as using the output of a multiple linear regression model to approximate the log odds, so as to map a continuous target to a probability.

## 计算机代写|机器学习代写machine learning代考|Ridge regression

Ridge regression imposes a penalty on the size of the regression coefficients using L2 regularization, where the loss function takes the following form:
$$l=\sum_{i=1}^n\left(y_i-b-\sum_{j=1}^m x_{i j} w_j\right)^2+\lambda \sum_{j=1}^m w_j^2, \text { where } \lambda>0 .$$
$\lambda$ is a complexity parameter to control the degree of shrinkage: a larger $\lambda$ means a greater amount of shrinkage. The objective of ridge regression is to find the optimal $\mathbf{w}^$ such that $$\mathbf{w}^=\arg \min {\mathbf{w}}\left{\sum{i=1}^n\left(y_i-b-\sum_{j=1}^m x_{i j} w_j\right)^2+\lambda \sum_{j=1}^m w_j^2\right} .$$
This is equivalent to solving the following optimization problem:
$$\begin{array}{r} \mathbf{w}^*=\arg \min {\mathrm{w}}\left{\sum{i=1}^n\left(y_i-b-\sum_{j=1}^m x_{i j} w_j\right)^2\right}, \ \text { s.t. } \sum_{j=1}^m w_j^2 \leq t, \end{array}$$
here there is a one-to-one relationship between $\lambda$ and $t$, and the size constraint on the parameters (i.e. constraint on parameter values) is imposed explicitly in Equation (5.18). Ridge regression is effective to alleviate the problem of high variance brought about by correlated variables in multiple linear regression by shrinking coefficients close to (but not exactly) zero. It is also noted that bias $b$, which is not directly related to the parameters, is excluded from the penalty terms, as they aim to regularize the coefficients of parameters. Its value should also be determined in Equation (5.18). An example of using ridge regression to predict ship deficiency number using the features of Example $5.2$ based on scikit-learn API is as follows.
Example 5.5: Min-max scaling is also first applied to numerical features age, GT, last inspection time, and last deficiency number. Ridge regression with hyperparameter tuning for $\lambda$ based on 5 -fold cross-validation can easily be implemented by the RidgeCV method provided by scikit-learn API.

## 计算机代写|机器学习代写machine learning代考|Logistic regression

$$\tilde{y}=\frac{1}{1+e^{-z}},$$

$$\tilde{y}=\frac{1}{1+e^{-z}} \Rightarrow z \quad=\tilde{\boldsymbol{x}} \tilde{\boldsymbol{w}}=\ln \frac{\tilde{y}}{1-\tilde{y}}$$

## 计算机代写|机器学习代写machine learning代考|Ridge regression

$$l=\sum_{i=1}^n\left(y_i-b-\sum_{j=1}^m x_{i j} w_j\right)^2+\lambda \sum_{j=1}^m w_j^2, \text { where } \lambda>0 .$$
$\lambda$ 是控制收缩程度的复杂参数：一个较大的 $\lambda$ 意味着更大的收缩量。岭回归的目标是找到最优 Imathbf{w}^ 这样

## 计算机代写|机器学习代写machine learning代考|COMP30027

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|The background and development of PSC

PSC is then developed, aiming to inspect foreign visiting ships in national ports to verify that “the condition of the ship and its equipment comply with the requirements of international regulations and that the ship is manned and operated in compliance with these rules” as mentioned by the IMO [7]. During an inspection, a condition onboard that does not comply with the requirements of the relevant convention is called a deficiency. The number and nature of the deficiencies found onboard determine the corresponding action taken by the PSC officer(s) (PSCO[s]). Common actions include rectifying a deficiency at the next port within 14 days or before departure and ship detention. Especially, ship detention is an intervention action taken by the port state that prevents a severely substandard ship from proceeding to sea until it would not present danger to the ship or persons onboard as well as to the marine environment.

PSC inspection is carried out on a regional level. The Memorandum of Understanding (MoU) on PSC was first signed in 1982 by 14 European countries, which is called the Paris MoU and marks the establishment of PSC. Since then, the number of member states of the Paris MoU has constantly increased, and it contains 27 participating maritime administrations covering the waters of the European coastal States and the North Atlantic basin as of January 2022. Another large regional MoU is in the Far East responsible for the Asia Pacific region, which is called Tokyo MoU and was signed in 1993. It now contains 22 member states. In addition, there are another seven MoUs on PSC, namely Acuerdo de Viña del Mar (Latin America), Caribbean MoU (Caribbean), Abuja MoU (West and Central Africa), Black Sea MoU (the Black Sea region), Mediterranean $\mathrm{MoU}$ (the Mediterranean), Indian Ocean $\mathrm{MoU}$ (the Indian Ocean), and the Riyadh MoU. The main objectives of constructing MoUs are constructing an improved and harmonized PSC system, strengthening cooperation and information exchange among member states, and avoiding multiple inspections within a short period. Apart from the nine regional MoUs, the United States Coast Guard maintains the tenth PSC regime.

## 计算机代写|机器学习代写machine learning代考|Simple linear regression and the least squares

Simple linear regression uses only one feature to predict the target. For example, we use ship age to predict the number of deficiencies of a PSC inspection. Denote the training set with $n$ samples by $D=\left{\left(x_1, y_1\right),\left(x_2, y_2\right), \ldots,\left(x_n, y_n\right)\right}$ and the feature vector by $x$. Simple linear regression aims to develop a model taking the following form:
$$\hat{y}i=w x_i+b,$$ where $\hat{y}_i$ is the predicted target for sample $i, w$ is the parameter weight and $b$ is the bias. $w$ and $b$ need to be learned from $D$. Then, a natural question is: what are good $w$ and $b$ ? Or in other words, how to find the values of $w$ and $b$ such that the predicted target is as accurate as possible? The key point of developing a simple linear regression model is to evaluate the difference between $\hat{y}_i$ and $y, i=1, \ldots, n$ using the loss function and to adopt the values of $w$ and $b$ that minimize the loss function. In a regression problem, the most commonly used loss function is the mean squared error (MSE), where $M S E=\frac{1}{n} \sum{i=1}^n\left(y_i-\hat{y}_i\right)^2$. Therefore, the learning objective of simple linear regression is to find the optimal $\left(w^, b^\right)$ such that the MSE is minimized. The above idea can be presented by the following mathematical functions:

\begin{aligned} \left(w^, x^\right) & =\underset{(w, b)}{\arg \min } \sum_{i=1}^n\left(y_i-\hat{y}i\right)^2 \ & =\underset{(w, b)}{\arg \min } \sum{i=1}^n\left(y_i-w x_i-b\right)^2 \end{aligned}
This idea is called the least squares method. The intuition behind it is to minimize the sum of lengths of the vertical lines between all the samples and the regression line determined by $w$ and $b$. It can easily be shown that $M S E$ is convex in $w$ and $b$, and thus $\left(w^, b^\right)$ can be found by
\begin{aligned} \frac{\partial M S E}{\partial w} & =2\left(\sum_{i=1}^n x_i\left[w x_i-\left(y_i-b\right)\right]\right)=0 \ \Rightarrow w^* & =\frac{\sum_{i=1}^n y_i\left(x_i-\frac{1}{n} \sum_{i=1}^n x_i\right)}{\sum_{i=1}^n x_i^2-\frac{1}{n}\left(\sum_{i=1}^n x_i\right)^2} \end{aligned}
The optimal $w^$ is first found by Equation (5.2), and then it can be used to calculate the optimal value of $b$, denoted by $b^$, as follows:
\begin{aligned} & \frac{\partial M S E}{\partial b}=2\left(\sum_{i=1}^n w^* x_i+b-y_i\right)=0 \ & \Rightarrow b^=\frac{1}{n} \sum_{i=1}^n\left(y_i-w^ x_i\right) \end{aligned}
Simple linear regression can easily be realized by scikit-learn API [1] in Python. Here is ann exannplè of using ship aage to predict ship deficiencyy number using simplé linear regression.

## 计算机代写|机器学习代写machine learning代考|The background and development of PSC

PSC由此而生，旨在对各国港口的外国来访船舶进行检查，以验证“船舶及其设备的状况符合国际规则的要求，船舶的配员和操作符合这些规则”。国际海事组织 [7]。在检查过程中，船上出现不符合相关公约要求的情况称为缺陷。船上发现的缺陷的数量和性质决定了 PSC 官员 (PSCO[s]) 采取的相应行动。常见的行动包括在 14 天内或在出发和船舶滞留之前在下一个港口纠正缺陷。尤其，

PSC 检查在区域层面进行。1982年，14个欧洲国家首次签署了关于PSC的谅解备忘录（MoU），称为巴黎谅解备忘录，标志着PSC正式成立。此后，巴黎谅解备忘录的成员国数量不断增加，截至 2022 年 1 月，已有 27 个参与海事管理机构覆盖欧洲沿海国家和北大西洋盆地的海域。另一个大型区域性谅解备忘录在远东地区负责亚太地区，称为东京谅解备忘录，于 1993 年签署。它现在包含 22 个成员国。此外，还有另外七份关于 PSC 的谅解备忘录，即 Acuerdo de Viña del Mar（拉丁美洲）、Caribbean MoU（加勒比）、Abuja MoU（西非和中非）、Black Sea MoU（黑海地区）、Mediterranean米欧在（地中海）、印度洋米欧在（印度洋）和利雅得谅解备忘录。构建谅解备忘录的主要目标是构建完善和统一的PSC体系，加强成员国之间的合作和信息交流，避免在短期内进行多次检查。除了九个区域谅解备忘录外，美国海岸警卫队还维持第十个 PSC 制度。

## 计算机代写|机器学习代写machine learning代考|Simple linear regression and the least squares

$$\hat{y} i=w x_i+b,$$

$$\frac{\partial M S E}{\partial w}=2\left(\sum_{i=1}^n x_i\left[w x_i-\left(y_i-b\right)\right]\right)=0 \Rightarrow w^* \quad=\frac{\sum_{i=1}^n y_i\left(x_i-\frac{1}{n} \sum_{i=1}^n x_i\right)}{\sum_{i=1}^n x_i^2-\frac{1}{n}\left(\sum_{i=1}^n x_i\right)^2}$$

$$\frac{\partial M S E}{\partial b}=2\left(\sum_{i=1}^n w^* x_i+b-y_i\right)=0 \quad \Rightarrow b^{=} \frac{1}{n} \sum_{i=1}^n\left(y_i-w_i^x\right)$$
Python 中的 scikit-learn API [1] 可以轻松实现简单的线性回归。这是使用船舶年龄通过简单线 性回归预测船舶缺陷数量的示例。

## 计算机代写|机器学习代写machine learning代考|COMP5318

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|Container liner shipping

A majority of cargoes in supermarkets, such as fruits and vegetables, kitchen appliances, furniture, garments, meats, fish, dairy products, and toys, are transported in containers by ship. Containers are usually expressed in terms of TEUs, a box that is 20 feet long $(6.1 \mathrm{~m})$. Throughout this book, unless otherwise specified, we use “TEU” to express “the number of containers” or “the volume of containers.”

Containers are transported by ship on liner services, which are similar to bus services. Figure $1.1$ is the Central China $2(\mathrm{CC} 2)$ service operated by Orient Overseas Container Line (OOCL), a Hong Kong-based shipping company. We call it a service, a route, or a service route. A route is a loop, and the port rotation of a route is the sequence of ports of call on the route. Any port of call can be defined as the first port of call. For example, if we define Ningbo as the first port of call, then Shanghai is the second port of call, and Los Angeles is the third port of call. We can therefore represent the port rotation of the route as follows:
Ningbo (1) $\rightarrow$ Shanghai $(2) \rightarrow$ Los Angeles $(3) \rightarrow$ Ningbo (1)

Note that on a route, different ports of call may be the same physical port. For example, the Central China 1 (CC1) service of OOCL shown in Figure $1.2$ has the port rotation below:

Shanghai (1) $\rightarrow$ Kwangyang (2) $\rightarrow$ Pusan (3) $\rightarrow$ Los Angeles (4) $\rightarrow$ Oakland $(5) \rightarrow$ Pusan $(6) \rightarrow$ Kwangyang $(7) \rightarrow$ Shanghai (1)

Both the second and the seventh ports of call are Kwangyang, and both the third and the sixth ports of call are Pusan.

A leg is the voyage from one port of call to the next. Leg $i$ is the voyage from the $i$ th port of call to port of call $i+1$. The last leg is the voyage from the last port of call to the first port of call. On CCl, the second leg is the voyage from Kwangyang (the second) to Pusan (the third), and the seventh leg is the voyage from Kwangyang (the seventh) to Shanghai (the first).

The rotation time of a route is the time required for a ship to start from the first port of call, visit all ports of call on the route, and return to the first port of call. As can be read from Figures $1.1$ and 1.2, the rotation time of $\mathrm{CC} 2$ is 35 days*, and the rotation time of $\mathrm{CC} 1$ is 42 days. Each route provides a weekly frequency, which means that each port of call is visited on the same day every week. Therefore, a string of five ships are deployed on $\mathrm{CC} 2$, and the headway between two adjacent ships is 7 days. These five ships usually have the same TEU capacity and other characteristics. Unless otherwise specified, we assume weekly frequencies for all routes.

## 计算机代写|机器学习代写machine learning代考|Key issues in maritime transport

Maritime transport is a highly globalized industry in terms of operation and management. For ship operation, ocean-going vessels sail on the high seas from the origin port in one country/region to the destination port in another country/region. For ship management, parties responsible for ship ownership, crewing, and operating may locate in different countries and regions. Even the country of registration, i.e., ship flag state, may not have a direct link and connection with a ship’s activities as the ship may not frequently visit the ports belonging to its flag state. For inland countries such as Mongolia, the ships registered under it never visit its ports. Such complex and disintegrated nature of the shipping industry makes it hard to control and regulate international shipping activities, and thus pose danger to maritime safety, the marine environment, and the crew and cargoes carried by ocean-going vessels.
Shipping is one of the world’s most dangerous industries due to the complex and ever-changing environment at sea, the dangerous goods carried, and the difficulties in search and rescue. Safety at sea is always put at the highest priority in ship operation and management. It is widely believed that the most effective and efficient way of improving safety at sea is to develop international regulations that should be followed by all shipping nations [1]. A unified and permanent international body was expected to be established for regulation and supervision by several nations from the mid-19th century onward, and the hopes came true after the International Maritime Organization (IMO, whose original name was Inter-Governmental Maritime Consultative Organization) was established at an international conference in Geneva held in 1948. Through hard efforts of all parties, the members of IMO met for the first time in 1959, one year after the IMO convention came into force. The IMO’s task was to adopt a new version of the most important conventions on maritime safety, i.e., the International Convention for the Safety of Life at Sea, which specifies minimum safety standards for ship construction, equipment, and operation. It covers comprehensive aspects of shipping safety, including vessel construction, fire safety, life-saving arrangements, radio communications, navigation safety, cargo carriage, dangerous goods transporting, the mandatory of the International Safety Management (ISM) code, verification of compliance, and measures for specific ships, and is constantly amended [2]. The Maritime Safety Committee is responsible for every aspect of maritime safety and security, and it is the highest technical body of the IMO.

## 计算机代写|机器学习代写machine learning代考|Container liner shipping

Shanghai (1) →光阳 (2)→滏山 (3)→洛杉矶 (4)→奥克兰 $(5) \rightarrow$ 釜山(6) →光阳(7) $\rightarrow$ Shanghai (1)

## 计算机代写|机器学习代写machine learning代考|COMP4702

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|Intuition and Main Results

Consider first the training error $E_{\text {train }}$ defined in (5.3). Since
$$\operatorname{tr} \mathbf{Y} \mathbf{Q}^2(\gamma) \mathbf{Y}^{\boldsymbol{\top}}=-\frac{\partial}{\partial \gamma} \operatorname{tr} \mathbf{Y} \mathbf{Q}(\gamma) \mathbf{Y}^{\top},$$
a deterministic equivalent for the resolvent $\mathbf{Q}(\gamma)$ is sufficient to acceess the asymptotic behavior of $E_{\text {train }}$.
With a linear activation $\sigma(t)=t$, the resolvent of interest
$$\mathbf{Q}(\gamma)=\left(\frac{1}{n} \sigma(\mathbf{W X})^{\top} \sigma(\mathbf{W} \mathbf{X})+\gamma \mathbf{I}n\right)^{-1}$$ is the same as in Theorem 2.6. In a sense, the evaluation of $\mathbf{Q}(\gamma)$ (and subsequently $\left.E{\text {train }}\right)$ calls for an extension of Theorem $2.6$ to handle the case of nonlinear activations. Recall now that the main ingredients to derive a deterministic equivalent for (the linear case) $\mathbf{Q}=\left(\mathbf{X}^{\top} \mathbf{W}^{\top} \mathbf{W} \mathbf{X} / n+\gamma \mathbf{I}n\right)^{-1}$ are (i) $\mathbf{X}^{\top} \mathbf{W}^{\top}$ has i.i.d. columns and (ii) its $i$ th column $\left[\mathbf{W}^{\top}\right]_i$ has i.i.d. (or linearly dependent) entries so that the key Lemma $2.11$ applies. These hold, in the linear case, due to the i.i.d. property of the entries of $\mathbf{W}$. However, while for Item (i), the nonlinear $\Sigma^{\top}=\sigma(\mathbf{W X})^{\top}$ still has i.i.d. columns, and for Item (ii), its $i$ th column $\sigma\left(\left[\mathbf{X}^{\top} \mathbf{W}^{\top}\right]{. i}\right)$ no longer has i.i.d. or linearly dependent entries. Therefore, the main technical difficulty here is to obtain a nonlinear version of the trace lemma, Lemma 2.11. That is, we expect that the concentration of quadratic forms around their expectation remains valid despite the application of the entry-wise nonlinear $\sigma$. This naturally falls into the concentration of measure theory discussed in Section $2.7$ and is given by the following lemma.

Lemma 5.1 (Concentration of nonlinear quadratic form, Louart et al. [2018, Lemma 1]). For $\mathbf{w} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)$, 1-Lipschitz $\sigma(\cdot)$, and $\mathbf{A} \in \mathbb{R}^{n \times n}, \mathbf{X} \in \mathbb{R}^{p \times n}$ such that $|\mathbf{A}| \leq 1$ and $|\mathbf{X}|$ bounded with respect to $p, n$, then,
$$\mathbb{P}\left(\left|\frac{1}{n} \sigma\left(\mathbf{w}^{\top} \mathbf{X}\right) \mathbf{A} \sigma\left(\mathbf{X}^{\top} \mathbf{w}\right)-\frac{1}{n} \operatorname{tr} \mathbf{A} \mathbf{K}\right|>t\right) \leq C e^{-c n \min \left(t, t^2\right)}$$ for some $C, c>0, p / n \in(0, \infty)$ with ${ }^2$
$$\mathbf{K} \equiv \mathbf{K}{\mathbf{X X}} \equiv \mathbb{E}{\mathbf{w} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)}\left[\sigma\left(\mathbf{X}^{\top} \mathbf{w}\right) \sigma\left(\mathbf{w}^{\boldsymbol{\top}} \mathbf{X}\right)\right] \in \mathbb{R}^{n \times n}$$

## 计算机代写|机器学习代写machine learning代考|Consequences for Learning with Large Neural Networks

To validate the asymptotic analysis in Theorem $5.1$ and Corollary $5.1$ on real-world data, Figures $5.2$ and $5.3$ compare the empirical MSEs with their limiting behavior predicted in Corollary 5.1, for a random network of $N=512$ neurons and various types of Lipschitz and non-Lipschitz activations $\sigma(\cdot)$, respectively. The regressor $\boldsymbol{\beta} \in \mathbb{R}^p$ maps the vectorized images from the Fashion-MNIST dataset (classes 1 and 2) [Xiao et al., 2017] to their corresponding uni-dimensional ( $d=1$ ) output labels $\mathbf{Y}{1 i}, \hat{\mathbf{Y}}{1 j} \in$ ${\pm 1}$. For $n, p, N$ of order a few hundreds (so not very large when compared to typical modern neural network dimensions), a close match between theory and practice is observed for the Lipschitz activations in Figure 5.2. The precision is less accurate but still quite good for the case of non-Lipschitz activations in Figure 5.3, which, we recall, are formally not supported by the theorem statement – here for $\sigma(t)=1-t^2 / 2$, $\sigma(t)=1_{t>0}$, and $\sigma(t)=\operatorname{sign}(t)$. For all activations, the deviation from theory is more acute for small values of regularization $\gamma$.

Figures $5.2$ and $5.3$ confirm that while the training error is a monotonically increasing function of the regularization parameter $\gamma$, there always exists an optimal value for $\gamma$ which minimizes the test error. In particular, the theoretical formulas derived in Corollary $5.1$ allow for a (data-dependent) fast offline tuning of the hyperparameter $\gamma$ of the network, in the setting where $n, p, N$ are not too small and comparable. In terms of activation functions (those listed here), we observe that, on the Fashion-MNIST dataset, the ReLU nonlinearity $\sigma(t)=\max (t, 0)$ is optimal and achieves the minimum test error, while the quadratic activation $\sigma(t)=1-t^2 / 2$ is the worst and produces much higher training and test errors compared to others. This observation will be theoretically explained through a deeper analysis of the corresponding kernel matrix $\mathbf{K}$, as performed in Section 5.1.2. Lastly, although not immediate at first sight, the training and test error curves of $\sigma(t)=1_{t>0}$ and $\sigma(t)=\operatorname{sign}(t)$ are indeed the same, up to a shift in $\gamma$, as a consequence of the fact that $\operatorname{sign}(t)=2 \cdot 1_{t>0}-1$.

## 计算机代写|机器学习代写machine learning代考|Intuition and Main Results

$$\operatorname{tr} \mathbf{Y} \mathbf{Q}^2(\gamma) \mathbf{Y}^{\top}=-\frac{\partial}{\partial \gamma} \operatorname{tr} \mathbf{Y} \mathbf{Q}(\gamma) \mathbf{Y}^{\top}$$

$$\mathbf{Q}(\gamma)=\left(\frac{1}{n} \sigma(\mathbf{W X})^{\top} \sigma(\mathbf{W X})+\gamma \mathbf{I} n\right)^{-1}$$

$\mathbf{Q}=\left(\mathbf{X}^{\top} \mathbf{W}^{\top} \mathbf{W X} / n+\gamma \mathbf{I} n\right)^{-1}$ 是我) $\mathbf{X}^{\top} \mathbf{W}^{\top}$ 有 iid 列和 (ii) 它的 $i$ 第 列 $\left[\mathbf{W}^{\top}\right]_i$ 具有独立同分布 (或线性相关) 条目，因此密钥引理 $2.11$ 适用。在线性情况下，由于条目的 iid 属性，这些成立 W. 然 而，对于项目 (i)，非线性 $\Sigma^{\top}=\sigma(\mathbf{W X})^{\top}$ 仍然有 iid 列，对于项目 (ii)，其 $i$ 第列 $\sigma\left(\left[\mathbf{X}^{\top} \mathbf{W}^{\top}\right] . i\right)$ 不 再具有 iid 或线性相关条目。因此，这里的主要技术难点是获得非线性版本的迹引理，引理 2.11。也就是 说，我们预计尽管应用了逐项非线性 $\sigma$. 这自然落入第 节讨论的测度论的集中 $2.7$ 并由以下引理给出。

$$\mathbb{P}\left(\left|\frac{1}{n} \sigma\left(\mathbf{w}^{\top} \mathbf{X}\right) \mathbf{A} \sigma\left(\mathbf{X}^{\top} \mathbf{w}\right)-\frac{1}{n} \operatorname{tr} \mathbf{A K}\right|>t\right) \leq C e^{-c n \min \left(t, t^2\right)}$$

$$\mathbf{K} \equiv \mathbf{K X X} \equiv \mathbb{E} \mathbf{w} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)\left[\sigma\left(\mathbf{X}^{\top} \mathbf{w}\right) \sigma\left(\mathbf{w}^{\top} \mathbf{X}\right)\right] \in \mathbb{R}^{n \times n}$$

## 有限元方法代写

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

## 计算机代写|机器学习代写machine learning代考|COMP30027

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|Random Neural Networks

Although much less popular than modern deep neural networks, neural networks with random fixed weights are simpler to analyze. Such networks have frequently arisen in the past decades as an appropriate solution to handle the possibly restricted number of training data, to reduce the computational and memory complexity and, from another viewpoint, can be seen as efficient random feature extractors. These neural networks in fact find their roots in Rosenblatt’s perceptron [Rosenblatt, 1958] and have then been many times revisited, rediscovered, and analyzed in a number of works, both in their feedforward [Schmidt et al., 1992] and recurrent [Gelenbe, 1993] versions. The simplest modern versions of these random networks are the so-called extreme learning machine [Huang et al., 2012] for the feedforward case, which one may seem as a mere linear regression method on nonlinear random features, and the echo state network [Jaeger, 2001] for the recurrent case. Also see Scardapane and Wang [2017] for a more exhaustive overview of randomness in neural networks.

It is also to be noted that deep neural networks are initialized at random and that random operations (such as random node deletions or voluntarily not-learning a large proportion of randomly initialized neural network weights, that is, random dropout) are common and efficient in neural network learning [Srivastava et al., 2014, Frankle and Carbin, 2019]. We may also point the recent endeavor toward neural network “learning without backpropagation,” which, inspired by biological neural networks (which naturally do not operate backpropagation learning), proposes learning mechanisms with fixed random backward weights and asymmetric forward learning procedures [Lillicrap et al., 2016, Nøkland, 2016, Baldi et al., 2018, Frenkel et al., 2019, Han et al., 2019]. As such, the study of random neural network structures may be instrumental to future improved understanding and designs of advanced neural network structures.

As shall be seen subsequently, the simple models of random neural networks are to a large extent connected to kernel matrices. More specifically, the classification or regression performance at the output of these random neural networks are functionals of random matrices that fall into the wide class of kernel random matrices, yet of a slightly different form than those studied in Section 4. Perhaps more surprisingly, this connection still exists for deep neural networks which are (i) randomly initialized and (ii) then trained with gradient descent, via the so-called neural tangent kernel [Jacot et al., 2018] by considering the “infinitely many neurons” limit, that is, the limit where the network widths of all layers go to infinity simultaneously. This close connection between neural networks and kernels has triggered a renewed interest for the theoretical investigation of deep neural networks from various perspectives including optimization [Du et al., 2019, Chizat et al., 2019], generalization [Allen-Zhu et al., 2019, Arora et al., 2019a, Bietti and Mairal, 2019], and learning dynamics [Lee et al., 2020, Advani et al., 2020, Liao and Couillet, 2018a]. These works shed new light on our theoretical understanding of deep neural network models and specifically demonstrate the significance of studying simple networks with random weights and their associated kernels to assess the intrinsic mechanisms of more elaborate and practical deep networks.

## 计算机代写|机器学习代写machine learning代考|Regression with Random Neural Networks

Throughout this section, we consider a feedforward single-hidden-layer neural network, as illustrated in Figure $5.1$ (displayed, for notational convenience, from right to left). A similar class of single-hidden-layer neural network models, however with a recurrent structure, will be discussed later in Section 5.3.

Given input data $\mathbf{X}=\left[\mathbf{x}_1, \ldots, \mathbf{x}_n\right] \in \mathbb{R}^{p \times n}$, we denote $\Sigma \equiv \sigma(\mathbf{W} \mathbf{X}) \in \mathbb{R}^{N \times n}$ the output of the first layer comprising $N$ neurons. This output arises from the premultiplication of $\mathbf{X}$ by some random weight matrix $\mathbf{W} \in \mathbb{R}^{N \times p}$ with i.i.d. (say standard Gaussian) entries and the entry-wise application of the nonlinear activation function $\sigma: \mathbb{R} \rightarrow \mathbb{R}$. As such, the columns $\sigma\left(\mathbf{W x}_i\right)$ of $\Sigma$ can be seen as random nonlinear features of $\mathbf{x}_i$. The second layer weight $\boldsymbol{\beta} \in \mathbb{R}^{N \times d}$ is then learned to adapt the feature matrix $\Sigma$ to some associated target $\mathbf{Y}=\left[\mathbf{y}_1, \ldots, \mathbf{y}_n\right] \in \mathbb{R}^{d \times n}$, for instance, by minimizing the Frobenius norm $\left|\mathbf{Y}-\boldsymbol{\beta}^{\top} \Sigma\right|_F^2$.

Remark 5.1 (Random neural networks, random feature maps and random kernels). The columns of $\Sigma$ may be seen as the output of the $\mathbb{R}^p \rightarrow \mathbb{R}^N$ random feature map $\phi: \mathbf{x}i \mapsto \sigma\left(\mathbf{W} \mathbf{x}_i\right)$ for some given $\mathbf{W} \in \mathbb{R}^{N \times p}$. In Rahimi and Recht [2008], it is shown that, for every nonnegative definite “shift-invariant” kernel of the form $(\mathbf{x}, \mathbf{y}) \mapsto f\left(|\mathbf{x}-\mathbf{y}|^2\right)$, there exist appropriate choices for $\sigma$ and the law of the entries of $\mathbf{W}$ so that as the number of neurons or random features $N \rightarrow \infty$, $$\sigma\left(\mathbf{W} \mathbf{x}_i\right)^{\top} \sigma\left(\mathbf{W} \mathbf{x}_j\right) \stackrel{\text { a.s. }}{\longrightarrow} f\left(\left|\mathbf{x}_i-\mathbf{x}_j\right|^2\right) .$$ As such, for large enough $N$ (that in general must scale with $n, p$ ), the bivariate function $(\mathbf{x}, \mathbf{y}) \mapsto \sigma(\mathbf{W} \mathbf{x})^{\top} \sigma(\mathbf{W y})$ approximates a kernel function of the type $f\left(|\mathbf{x}-\mathbf{y}|^2\right)$ studied in Chapter 4. This result is then generalized, in subsequent works, to a larger family of kernels including inner-product kernels [Kar and Karnick, 2012], additive homogeneous kernels [Vedaldi and Zisserman, 2012], etc. Another, possibly more marginal, connection with the previous sections is that $\sigma\left(\mathbf{w}^{\top} \mathbf{x}\right)$ can be interpreted as a “properly scaling” inner-product kernel function applied to the “data” pair $\mathbf{w}, \mathbf{x} \in \mathbb{R}^p$. This technically induces another strong relation between the study of kernels and that of neural networks. Again, similar to the concentration of (Euclidean) distance extensively explored in this chapter, the entry-wise convergence in (5.1) does not imply convergence in the operator norm sense, which, as we shall see, leads directly to the so-called “double descent” test curve in random feature/neural network models. If the network output weight matrix $\boldsymbol{\beta}$ is designed to minimize the regularized MSE $L(\boldsymbol{\beta})=\frac{1}{n} \sum{i=1}^n\left|\mathbf{y}_i-\boldsymbol{\beta}^{\top} \sigma\left(\mathbf{W x}_i\right)\right|^2+\gamma|\boldsymbol{\beta}|_F^2$, for some regularization parameter $\gamma>0$, then $\beta$ takes the explicit form of a ridge-regressor ${ }^1$
$$\beta \equiv \frac{1}{n} \Sigma\left(\frac{1}{n} \Sigma^{\top} \Sigma+\gamma \mathbf{I}_n\right)^{-1} \mathbf{Y}^{\top},$$
which follows from differentiating $L(\boldsymbol{\beta})$ with respect to $\boldsymbol{\beta}$ to obtain $0=\gamma \boldsymbol{\beta}+$ $\frac{1}{n} \Sigma\left(\Sigma^{\top} \boldsymbol{\beta}-\mathbf{Y}^{\top}\right)$ so that $\left(\frac{1}{n} \Sigma \Sigma^{\top}+\gamma \mathbf{I}_N\right) \boldsymbol{\beta}=\frac{1}{n} \Sigma \mathbf{Y}^{\top}$ which, along with $\left(\frac{1}{n} \Sigma \Sigma^{\top}+\right.$ $\left.\gamma \mathbf{I}_N\right)^{-1} \Sigma=\Sigma\left(\frac{1}{n} \Sigma^{\top} \Sigma+\gamma \mathbf{I}_n\right)^{-1}$ for $\gamma>0$, gives the result.

## 计算机代写|机器学习代写machine learning代考|Regression with Random Neural Networks

$$\beta \equiv \frac{1}{n} \Sigma\left(\frac{1}{n} \Sigma^{\top} \Sigma+\gamma \mathbf{I}_n\right)^{-1} \mathbf{Y}^{\top},$$

## 计算机代写|机器学习代写machine learning代考|COMP5318

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|Concluding Remarks

Before the present chapter, the first part of the book was mostly concerned with the sample covariance matrix model $\mathbf{X} \mathbf{X}^{\top} / n$ (and more marginally with the Wigner model $\mathbf{X} / \sqrt{n}$ for symmetric $\mathbf{X}$ ), where the columns of $\mathbf{X}$ are independent and the entries of each column are independent or linearly dependent. Historically, this model and its numerous variations (with a variance profile, with right-side correlation, summed up to other independent matrices of the same form, etc.) have covered most of the mathematical and applied interest of the first two decades (since the early nineties) of intense random matrix advances. The main drivers for these early developments were statistics, signal processing, and wireless communications. The present chapter leaped much further in considering now random matrix models with possibly highly correlated entries, with a specific focus on kernel matrices. When (moderately) largedimensional data are considered, the intuition and theoretical understanding of kernel matrices in small-dimensional setting being no longer accurate, random matrix theory provides accurate (and asymptotically exact) performance assessment along with the possibility to largely improve the performance of kernel-based machine learning methods. This, in effect, creates a small revolution in our understanding of machine learning on realistic large datasets.

A first important finding of the analysis of large-dimensional kernel statistics reported here is the ubiquitous character of the Marčenko-Pastur and the semi-circular laws. As a matter of fact, all random matrix models studied in this chapter, and in particular the kernel regimes $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j / p\right)$ (which concentrate around $f(0)$ ) and $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j / \sqrt{p}\right.$ ) (which tends to $f(\mathcal{N}(0,1))$ ), have a limiting eigenvalue distribution akin to a combination of the two laws. This combination may vary from case to case (compare for instance the results of Practical Lecture 3 to Theorem 4.4), but is often parametrized in a such way that the Marčenko-Pastur and semicircle laws appear as limiting cases (in the context of Practical Lecture 3, they correspond to the limiting cases of dense versus sparse kernels, and in Theorem $4.4$ to the limiting cases of linear versus “purely” nonlinear kernels).

## 计算机代写|机器学习代写machine learning代考|Practical Course Material

In this section, Practical Lecture 3 (that evaluates the spectral behavior of uniformly sparsified kernels) related to the present Chapter 4 is discussed, where we shall see, as for $\alpha-\beta$ and properly scaling kernels in Sections $4.2 .4$ and $4.3$ that, depending on the “level of sparsity,” a combination of Marčenko-Pastur and semicircle laws is observed.
Practical Lecture Material 3 (Complexity-performance trade-off in spectral clustering with sparse kernel, Zarrouk et al. [2020]). In this exercise, we study the spectrum of a “punctured” version $\mathbf{K}=\mathbf{B} \odot\left(\mathbf{X}^{\top} \mathbf{X} / p\right.$ ) (with the Hadamard product $[\mathbf{A} \odot \mathbf{B}]{i j}=[\mathbf{A}]{i j}[\mathbf{B}]{i j}$ of the linear kernel $\mathbf{X}^{\top} \mathbf{X} / p$, with data matrix $\mathbf{X} \in \mathbb{R}^{p \times n}$ and a symmetric random mask-matrix $\mathbf{B} \in{0,1}^{n \times n}$ having independent $[\mathbf{B}]{i j} \sim \operatorname{Bern}(\boldsymbol{\epsilon})$ entries for $i \neq j$ (up to symmetry) and $[\mathbf{B}]_{i i}=b \in{0,1}$ fixed, in the limit $p, n \rightarrow \infty$ with $p / n \rightarrow c \in(0, \infty)$. This matrix mimics the computation of only a proportion $\epsilon \in(0,1)$ of the entries of $\mathbf{X}^{\top} \mathbf{X} / n$, and its impact on spectral clustering. Letting $\mathbf{X}=\left[\mathbf{x}_1, \ldots, \mathbf{x}_n\right]$ with $\mathbf{x}_i$ independently and uniformly drawn from the following symmetric two-class Gaussian mixture
$$\mathcal{C}_1: \mathbf{x}_i \sim \mathcal{N}\left(-\boldsymbol{\mu}, \mathbf{I}_p\right), \quad \mathcal{C}_2: \mathbf{x}_i \sim \mathcal{N}\left(+\boldsymbol{\mu}, \mathbf{I}_p\right)$$
for $\boldsymbol{\mu} \in \mathbb{R}^p$ such that $|\boldsymbol{\mu}|=O(1)$ with respect to $n, p$, we wish to study the effect of a uniform “zeroing out” of the entries of $\mathbf{X}^{\top} \mathbf{X}$ on the presence of an isolated spike in the spectrum of $\mathbf{K}$, and thus on the spectral clustering performance.

We will study the spectrum of $\mathbf{K}$ using Stein’s lemma and the Gaussian method discussed in Section 2.2.2. Let $\mathbf{Z}=\left[\mathbf{z}1, \ldots, \mathbf{z}_n\right] \in \mathbb{R}^{p \times n}$ for $\mathbf{z}_i=\mathbf{x}_i-(-1)^a \boldsymbol{\mu} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)$ with $\mathbf{x}_i \in \mathcal{C}_a$ and $\mathbf{M}=\mu \mathbf{j}^{\top}$ with $\mathbf{j}=\left[-\mathbf{1}{n / 2}, \mathbf{1}_{n / 2}\right]^{\top} \in \mathbb{R}^n$ so that $\mathbf{X}=\mathbf{M}+\mathbf{Z}$. First show that, for $\mathbf{Q} \equiv \mathbf{Q}(z)=\left(\mathbf{K}-z \mathbf{I}_n\right)^{-1}$,
\begin{aligned} \mathbf{Q}= & -\frac{1}{z} \mathbf{I}_n+\frac{1}{z}\left(\frac{\mathbf{Z}^{\boldsymbol{}} \mathbf{Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{Z}^{\boldsymbol{T}} \mathbf{M}}{p} \odot \mathbf{B}\right) \mathbf{Q} \ & +\frac{1}{z}\left(\frac{\mathbf{M}^{\boldsymbol{\top}} \mathbf{Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{M}^{\boldsymbol{T}} \mathbf{M}}{p} \odot \mathbf{B}\right) \mathbf{Q} . \end{aligned}
To proceed, we need to go slightly beyond the study of these four terms.

## 计算机代写|机器学习代写machine learning代考|Practical Course Material

$$\mathbf{Q}=-\frac{1}{z} \mathbf{I}_n+\frac{1}{z}\left(\frac{\mathbf{Z Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{Z}^T \mathbf{M}}{p} \odot \mathbf{B}\right) \mathbf{Q} \quad+\frac{1}{z}\left(\frac{\mathbf{M}^{\top} \mathbf{Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{M}^T}{p}\right.$$

## 计算机代写|机器学习代写machine learning代考|COMP4702

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|Distance and Inner-Product Random Kernel Matrices

The most widely used kernel model in machine learning applications is the heat kernel $\mathbf{K}=\left{\exp \left(-\left|\mathbf{x}i-\mathbf{x}_j\right|^2 / 2 \sigma^2\right)\right}{i, j=1}^n$, for some $\sigma>0$. It is thus natural to start the large-dimensional analysis of kernel random matrices by focusing on this model.
As mentioned in the previous sections, for the Gaussian mixture model above, as the dimension $p$ increases, $\sigma^2$ needs to scale as $O(p)$, so say $\sigma^2=\tilde{\sigma}^2 p$ for some $\tilde{\sigma}^2=O(1)$, to avoid evaluating the exponential at increasingly large values for $p$ large. As such, the prototypical kernel of present interest is
$$\mathbf{K}=\left{f\left(\frac{1}{p}\left|\mathbf{x}i-\mathbf{x}_j\right|^2\right)\right}{i, j-1}^n,$$
for $f$ a sufficiently smooth function (specifically, $f(t)=\exp \left(-t / 2 \tilde{\sigma}^2\right)$ for the heat kernel). As we will see though, it is much desirable not to restrict ourselves to $f(t)=\exp \left(-t / 2 \tilde{\sigma}^2\right)$ so to better appreciate the impact of the nonlinear kernel function $f$ on the (asymptotic) structural behavior of the kernel matrix $\mathbf{K}$.

## 计算机代写|机器学习代写machine learning代考|Euclidean Random Matrices with Equal Covariances

In order to get a first picture of the large-dimensional behavior of $\mathbf{K}$, let us first develop the distance $\left|\mathbf{x}_i-\mathbf{x}_j\right|^2 / p$ for $\mathbf{x}_i \in \mathcal{C}_a$ and $\mathbf{x}_j \in \mathcal{C}_b$, with $i \neq j$.

For simplicity, let us assume for the moment $\mathbf{C}_1=\cdots=\mathbf{C}_k=\mathbf{I}_p$ and recall the notation $\mathbf{x}_i=\boldsymbol{\mu}_a+\mathbf{z}_i$. We have, for $i \neq j$ that “entry-wise,”
\begin{aligned} \frac{1}{p}\left|\mathbf{x}_i-\mathbf{x}_j\right|^2= & \frac{1}{p}\left|\boldsymbol{\mu}_a-\boldsymbol{\mu}_b\right|^2+\frac{2}{p}\left(\boldsymbol{\mu}_a-\boldsymbol{\mu}_b\right)^{\top}\left(\mathbf{z}_i-\mathbf{z}_j\right) \ & +\frac{1}{p}\left|\mathbf{z}_i\right|^2+\frac{1}{p}\left|\mathbf{z}_j\right|^2-\frac{2}{p} \mathbf{z}_i^{\top} \mathbf{z}_j . \end{aligned}
For $\left|\mathbf{x}_i\right|$ of order $O(\sqrt{p})$, if $\left|\mu_a\right|=O(\sqrt{p})$ for all $a \in{1, \ldots, k}$ (which would be natural), then $\left|\mu_a-\mu_b\right|^2 / p$ is a priori of order $O(1)$ while, by the central limit theorem, $\left|\mathbf{z}_i\right|^2 / p=1+O\left(p^{-1 / 2}\right)$. Also, again by the central limit theorem, $\mathbf{z}_i^{\top} \mathbf{z}_j / p=$ $O\left(p^{-1 / 2}\right)$ and $\left(\mu_a-\mu_b\right)^{\top}\left(\mathbf{z}_i-\mathbf{z}_j\right) / p=O\left(p^{-1 / 2}\right)$

As a consequence, for $p$ large, the distance $\left|\mathbf{x}i-\mathbf{x}_j\right|^2 / p$ is dominated by $| \boldsymbol{\mu}_a-$ $\boldsymbol{\mu}_b |^2 / p+2$ and easily discriminates classes from the pairwise observations of $\mathbf{x}_i, \mathbf{x}_j$, making the classification asymptotically trivial (without having to resort to any kernel method). It is thus of interest consider the situations where the class distances are less significant to understand how the choices of kernel come into play in such more practical scenario. To this end, we now demand that $$\left|\mu_a-\mu_b\right|=O(1),$$ which is also the minimal distance rate that can be discriminated from a mere Bayesian inference analysis, as thoroughly discussed in Section 1.1.3. Since the kernel function $f(\cdot)$ operates only on the distances $\left|\mathbf{x}_i-\mathbf{x}_j\right|$, we may even request (up to centering all data by, say, the constant vector $\frac{1}{n} \sum{a=1}^k n_a \mu_a$ ) for simplicity that $\left|\mu_a\right|=O(1)$ for each $a$.

# 机器学习代考

## 计算机代写|机器学习代写machine learning代考|Euclidean Random Matrices with Equal Covariances

$$\frac{1}{p}\left|\mathbf{x}_i-\mathbf{x}_j\right|^2=\frac{1}{p}\left|\boldsymbol{\mu}_a-\boldsymbol{\mu}_b\right|^2+\frac{2}{p}\left(\boldsymbol{\mu}_a-\boldsymbol{\mu}_b\right)^{\top}\left(\mathbf{z}_i-\mathbf{z}_j\right) \quad+\frac{1}{p}\left|\mathbf{z}_i\right|^2+\frac{1}{p}\left|\mathbf{z}_j\right|^2-\frac{2}{p} \mathbf{z}_i^{\top} \mathbf{z}_j$$

$$\left|\mu_a-\mu_b\right|=O(1)$$

## 计算机代写|机器学习代写machine learning代考|COMP30027

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|The Nontrivial Growth Rates

In classical large- $n$ only asymptotic statistics, laws of large numbers demand a scaling by $1 / n$ of the summed observations. When centered, central limit theorems then occur after multiplication of the average by $\sqrt{n}$. A similar requirement is needed when we now consider that the dimension $p$ of the data is also large. In particular, we will demand that the norm of each observation remains bounded. Assuming $\mathbf{x} \in \mathbb{R}^p$ is a vector of bounded entries, that is, each of order $O(1)$ with respect to $p$, the natural normalization is typically $\mathbf{x} / \sqrt{p}$.

In the context of kernel methods, for data $\mathbf{x}_1, \ldots, \mathbf{x}_n$, one wishes that the argument of $f(\cdot)$ in the inner-product kernel $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j\right)$ or the distance kernel $f\left(\left|\mathbf{x}_i-\mathbf{x}_j\right|^2\right)$ be of order $O(1)$, when $f$ is assumed independent of $p$.

The “correct” scaling however appears not to be so immediate. Letting $\mathbf{x}i$ have entries of order $O(1)$, one naturally has that $\left|\mathbf{x}_i-\mathbf{x}_j\right|^2=\left|\mathbf{x}_i\right|^2+\left|\mathbf{x}_j\right|^2-2 \mathbf{x}_i^{\top} \mathbf{x}_j=$ $O(p)$ and it thus appears natural to scale $\left|\mathbf{x}_i-\mathbf{x}_j\right|^2$ by $1 / p$. Similarly, if the norm of the mean $\left|\mathbb{E}\left[\mathbf{x}_i\right]\right|$ of $\mathbf{x}_i$ has the same order of magnitude as $\left|\mathbf{x}_i\right|$ itself (as it should in general), then for $\mathbf{x}_i, \mathbf{x}_j$ independent, $\mathbb{E}\left[\mathbf{x}_i^{\top} \mathbf{x}_j\right]=O(p)$. So again, one should scale the inner-product also by $1 / p$, to obtain kernel matrices of the type $$\mathbf{K}=\left{f\left(\frac{1}{p}\left|\mathbf{x}_i-\mathbf{x}_j\right|^2\right)\right}{i, j=1}^n, \text { and }\left{f\left(\frac{1}{p} \mathbf{x}i^{\top} \mathbf{x}_j\right)\right}{i, j=1}^n$$
Section $4.2$ (and most applications thereafter) will be placed under these kernel forms. The most commonly used Gaussian kernel matrix, defined as $\mathbf{K}=\left{\exp \left(-| \mathbf{x}i-\right.\right.$ $\left.\left.\mathbf{x}_j |^2 / 2 \sigma^2\right)\right}{i, j=1}^n$, falls into this family as one usually demands that $\sigma^2 \sim \mathbb{E}\left[\left|\mathbf{x}_i-\mathbf{x}_j\right|^2\right]$ (to avoid evaluating the exponential close to zero or infinity).

However, as already demonstrated in Section 1.1.3, if $n$ scales like $p$, then, for the classification problem to be asymptotically nontrivial, the difference $\left|\mathbb{E}\left[\mathbf{x}_i\right]-\mathbb{E}\left[\mathbf{x}_j\right]\right|^2$ needs to scale like $O(1)$ rather than $O(p)$ (otherwise data classes would be too easy to cluster for all large $n, p$ ), resulting in $\left|\mathbf{x}_i-\mathbf{x}_j\right|^2 / p$ possibly converging to a constant value irrespective of the data classes (of $\mathbf{x}_i$ and $\mathbf{x}_j$ ), with a typical “spread” of order $O(1 / \sqrt{p})$. Similarly, up to re-centering, ${ }^2 \mathbf{x}i^{\top} \mathbf{x}_j / p$ scales like $O(1 / \sqrt{p})$ rather than $O(1)$. As such, it seems more appropriate to normalize the kernel matrix entries as $$[\mathbf{K}]{i j}=f\left(\frac{\left|\mathbf{x}i-\mathbf{x}_j\right|^2}{\sqrt{p}}-\frac{1}{n(n-1)} \sum{i^{\prime}, j^{\prime}} \frac{\left|\mathbf{x}{i^{\prime}}-\mathbf{x}{j^{\prime}}\right|^2}{\sqrt{p}}\right), \text { or }[\mathbf{K}]_{i j}=f\left(\frac{1}{\sqrt{p}} \mathbf{x}_i^{\top} \mathbf{x}_j\right)$$
in order here to avoid evaluating $f$ essentially at a single value (equal to zero for the inner-product kernel or equal to the average “common” limiting intra-data distance for the distance kernel).

This “properly scaling” setting is in fact much richer than the $1 / p$ normalization when $n, p$ are of the same order of magnitude. Sections $4.2 .4$ and $4.3$ elaborate on this scenario.

## 计算机代写|机器学习代写machine learning代考|Statistical Data Model

In the remainder of the section, we assume the observation of $n$ independent data vectors from a total of $k$ classes gathered as $\mathbf{X}=\left[\mathbf{x}1, \ldots, \mathbf{x}_n\right] \in \mathbb{R}^{p \times n}$, where $$\begin{array}{cc} \mathbf{x}_1, \ldots, \mathbf{x}{n_1} & \sim \mathcal{N}\left(\mu_1, \mathbf{C}1\right) \ \vdots & \vdots \ \mathbf{x}{n-n_k+1}, \ldots, \mathbf{x}n \sim \mathcal{N}\left(\mu_k, \mathbf{C}_k\right), \end{array}$$ which is a $k$-class Gaussian mixture model (GMM) with a fixed cardinality $n_1, \ldots, n_k$ in each class. ${ }^3$ The fact that the data are indexed according to classes simplifies the notation but has no practical consequence in the analysis. We will denote $\mathcal{C}_a$ the class number ” $a$,” so in particular $$\mathbf{x}_i \sim \mathcal{N}\left(\mu_a, \mathbf{C}_a\right) \Leftrightarrow \mathbf{x}_i \in \mathcal{C}_a$$ for $a \in{1, \ldots, k}$, and will use for convenience the matrix $$\mathbf{J}=\left[\mathbf{j}_1, \ldots, \mathbf{j}_k\right] \in \mathbb{R}^{n \times k}, \quad \mathbf{j}_a=[\underbrace{0, \ldots, 0}{n_1+\ldots+n_{a-1}}, \underbrace{1, \ldots, 1}{n_a}, \underbrace{0, \ldots, 0}{n_{a+1}+\ldots+n_k}]^{\top},$$
which is the indicator matrix of the class labels $(\mathbf{J}$ is a priori known under a supervised learning setting and is to be fully or partially recovered under a semi-supervised or unsupervised learning setting).

We shall systematically make the following simplifying growth rate assumption for $p, n$ and $n_1, \ldots, n_k$.

Assumption 1 (Growth rate of data size and number). As $n \rightarrow \infty, p / n \rightarrow c \in(0, \infty)$ and $n_a / n \rightarrow c_a \in(0,1)$.

This assumption, in particular, implies that each class is “large” in the sense that their cardinalities increase with $n^4$

Accordingly with the discussions in Chapter 2, from a random matrix “universality” perspective, the Gaussian mixture assumption will often (yet not always) turn out equivalent to demanding that
$$\mathbf{x}_i \in \mathcal{C}_a: \mathbf{x}_i=\mu_a+\mathbf{C}_a^{\frac{1}{2}} \mathbf{z}_i$$
with $\mathbf{z}_i \in \mathbb{R}^p$ a random vector with i.i.d. entries of zero mean, unit variance, and bounded higher-order (e.g., fourth) moments.

This hypothesis is indeed quite restrictive as it imposes that the data, up to centering and linear scaling, are composed of i.i.d. entries. Equivalently, this suggests that only data which result from affine transformations of vectors with i.i.d. entries can be studied, which is quite restrictive in practice as “real data” are deemed much more complex.

Exploring the notion of concentrated random vectors introduced in Section 2.7, Chapter 8 will open up this discussion by showing that a much larger class of (statistical) data models embrace the same asymptotic statistics, and that most results discussed in the present section apply identically to broader models of data irreducible to vectors of independent entries.

# 机器学习代考

## 计算机代写|机器学习代写machine learning代考|COMP5318

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|Kernel Methods

In a broad sense, kernel methods are at the core of many, if not most, machine learning algorithms [Schölkopf and Smola, 2018]. Given a set of data $\mathbf{x}1, \ldots, \mathbf{x}_n \in \mathbb{R}^p$, most learning mechanisms rely on extracting the structural data information from direct or indirect pairwise comparisons $\kappa\left(\mathbf{x}_i, \mathbf{x}_j\right)$ for some affinity metric $\kappa(\cdot, \cdot)$. Gathered in an $n \times n$ matrix $$\mathbf{K}=\left{\kappa\left(\mathbf{x}_i, \mathbf{x}_j\right)\right}{i, j=1}^n$$
the “cumulative” effect of these comparisons for numerous $(n \gg 1)$ data is at the source of various supervised, semi-supervised, or unsupervised methods such as support vector machines, graph Laplacian-based learning, kernel spectral clustering, and has deep connections to neural networks.

These applications will be thoroughly discussed in Section 4.4. For the moment though, our main interest lies in the spectral characterization of the kernel matrix $\mathbf{K}$ itself for various (classical) choices of affinity functions $\kappa$ and for various statistical models of the data $\mathbf{x}_i$

Clearly, from a purely machine learning perspective, the choice of the affinity function $\kappa(\cdot, \cdot)$ is central to a good performance of the learning method under study. Since real data in general have highly complex structures, a typical viewpoint is to assume that the data points $\mathbf{x}_i$ and $\mathbf{x}_j$ are not directly comparable in their ambient space but that there exists a convenient feature extraction function $\phi: \mathbb{R}^p \rightarrow \mathbb{R}^q(q \in \mathbb{N} \cup{+\infty})$ such that $\phi\left(\mathbf{x}_i\right)$ and $\phi\left(\mathbf{x}_j\right)$ are more amenable to comparison. Otherwise stated, in the image of $\phi(\cdot)$, the data are more “linear” (or more “linearly separable” if one seeks to group the data in affinity classes). The simplest affinity function between $\mathbf{x}_i$ and $\mathbf{x}_j$ would in this case be $\kappa\left(\mathbf{x}_i, \mathbf{x}_j\right)=\phi\left(\mathbf{x}_i\right)^{\top} \phi\left(\mathbf{x}_j\right)$

Since $q$ may be larger (if not much larger) than $p$, the mere cost of evaluating $\phi\left(\mathbf{x}_i\right)^{\top} \phi\left(\mathbf{x}_j\right)$ can be deleterious to practical implementation. The so-called kernel trick is anchored in the remark that, for a certain class of such functions $\phi, \phi\left(\mathbf{x}_i\right)^{\top} \phi\left(\mathbf{x}_j\right)=$ $f\left(\left|\mathbf{x}_i-\mathbf{x}_j\right|^2\right)$ or $-f\left(\mathbf{x}_i^{\top} \mathbf{x}_j\right)$ for some function $f: \mathbb{R} \rightarrow \mathbb{R}$ and it thus suffices to evaluate $\left|\mathbf{x}_i-\mathbf{x}_j\right|^2$ or $\mathbf{x}_i^{\top} \mathbf{x}_j$ in the ambient space and then apply $f$ in an entrywise manner to evaluate all data affinities, leading to more practically convenient methods.

Although the class of such functions $f$ is inherently restricted by the need for a mapping $\phi$ to exist such that, say, $\phi\left(\mathbf{x}_i\right)^{\top} \phi\left(\mathbf{x}_j\right)=f\left(\left|\mathbf{x}_i-\mathbf{x}_j\right|^2\right)$ for all possible $\mathbf{x}_i, \mathbf{x}_j$ pairs (these are sometimes called Mercer kernel functions), ${ }^1$ with time, practitioners have started to use arbitrary functions $f$ and worked with generic kernel matrices of the form
$$\mathbf{K}=\left{f\left(\left|\mathbf{x}i-\mathbf{x}_j\right|^2\right)\right}{i, j=1}^n, \quad \text { or } \quad \mathbf{K}=\left{f\left(\mathbf{x}i^{\top} \mathbf{x}_j\right)\right}{i, j=1}^n,$$
irrespective of the actual form or even the existence of an underlying feature extraction function $\phi$. There are, in particular, empirical evidences showing that well-chosen “indefinite” (i.e., nonMercer type) kernels, being not associated with a mapping $\phi$, can sometimes outperform conventional nonnegative definite kernels that satisfy the Mercer’s condition [Haasdonk, 2005, Luss and D’Aspremont, 2008].

## 计算机代写|机器学习代写machine learning代考|Basic Setting

As pointed out in Remark $4.1$ and shall become evident from the coming analysis, the small-dimensional intuition according to which $f$ should be a nonincreasing “valid” Mercer function becomes rather meaningless when dealing with large-dimensional data, essentially due to the “curse of dimensionality” and the concentration phenomenon in high dimensions.

To fully capture this aspect, a first important consideration is, as already mentioned in Section 1.1.3, to deal with “nontrivial” relative growth rates of the statistical data parameters with respect to the dimensions $p, n$. By nontrivial, we mean that the underlying classification or regression problem for which the kernel method is designed should neither be impossible nor trivially easy to solve as $p, n \rightarrow \infty$. The reason behind this request is fundamental, and also disrupts from many research works in machine learning which, instead, seek to prove that the method under study performs perfectly in the limit of large $n$ (with $p$ fixed in general): Here, we rather wish to account for the fact that, at finite but large $p, n$, the machine learning methods of practical interest are those which have nontrivial performances; thus, in what follows, ” $n, p \rightarrow \infty$ in nontrivial growth rates” should really be understood as ” $n, p$ are both large and the problem at hand is non-trivially easy or hard to solve.”

In this section, we will mostly focus on the use of kernel methods for classification, and thus the nontrivial settings are given in terms of the growth rate of the “distance” between (the statistics of) data classes. It will particularly appear that the very definition of the appropriate growth rates to ensure the nontrivial character of a machine learning problem to be solved through kernel methods depends on the kernel design itself, and that flagship kernels such as the Gaussian kernel $\kappa\left(\mathbf{x}_i, \mathbf{x}_j\right)=\exp \left(-\left|\mathbf{x}_i-\mathbf{x}_j\right|^2 / 2 \sigma^2\right)$ are in general quite suboptimal.

# 机器学习代考

## 计算机代写|机器学习代写machine learning代考|COMP4702

statistics-lab™ 为您的留学生涯保驾护航 在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础

## 计算机代写|机器学习代写machine learning代考|Explaining Kernel Methods with Random Matrix Theory

The fundamental reason behind this surprising behavior lies in the accumulated effect of the $n / 2$ small “hidden” informative terms $|\boldsymbol{\mu}|^2, \operatorname{tr} \mathbf{E}$ and $\operatorname{tr}\left(\mathbf{E}^2\right)$ in each class, which collectively “steer” the several top eigenvectors of $\mathbf{K}$. More explicitly, we shall see in the course of this book that the Gaussian kernel matrix $\mathbf{K}$ can be asymptotically expanded as
$$\mathbf{K}=\exp (-1)\left(\mathbf{1}n \mathbf{1}_n^{\boldsymbol{\top}}+\frac{1}{p} \mathbf{Z}^{\boldsymbol{\top}} \mathbf{Z}\right)+f(\boldsymbol{\mu}, \mathbf{E}) \cdot \frac{1}{p} \mathbf{j} \mathbf{j}^{\boldsymbol{\top}}++o{|\cdot|}(1),$$
where $\mathbf{Z}=\left[\mathbf{z}1, \ldots, \mathbf{z}_n\right] \in \mathbb{R}^{p \times n}$ is a Gaussian noise matrix, $f(\boldsymbol{\mu}, \mathbf{E})=O(1)$, and $\mathbf{j}=\left[\mathbf{1}{n / 2} ;-\mathbf{1}{n / 2}\right]$ is the class-information “label” vector (as in the setting of Figure 1.2). Here “” symbolizes extra terms of marginal importance to the present discussion, and $o{|\cdot|}(1)$ represents terms of asymptotically vanishing operator norm as $n, p \rightarrow \infty$. The important remark to be made here is that
(i) Under this description, $[\mathbf{K}]_{i j}=\exp (-1)\left(1+\mathbf{z}_i^{\top} \mathbf{z}_j / p\right) \pm f(\boldsymbol{\mu}, \mathbf{E}) / p+*$, with $f(\mu, \mathbf{E}) / p \ll \mathbf{z}_i^{\top} \mathbf{z}_j / p=O\left(p^{-1 / 2}\right)$; this is consistent with our previous discussion: The statistical information is entry-wise dominated by noise.
(ii) From a spectral viewpoint, $\left|\mathbf{Z}^{\top} \mathbf{Z} / p\right|=O$ (1), as per the Marčenko-Pastur theorem [Marčenko and Pastur, 1967] discussed in Section 1.1.2 and visually confirmed in Figure 1.1, while $|f(\boldsymbol{\mu}, \mathbf{E}) \cdot \mathbf{j} \mathbf{j} \mathrm{T} / p|=O(1)$ : Thus, spectrum-wise, the information stands on even ground with noise.

The mathematical magic at play here lies in $f(\boldsymbol{\mu}, \mathbf{E}) \cdot \mathbf{j} \mathbf{j} / / p$ having entries of order $O\left(p^{-1}\right)$ while being a low-rank (here unit-rank) matrix: All its “energy” concentrates in a single nonzero eigenvalue. As for $\mathbf{Z}^{\top} \mathbf{Z} / p$, with larger $O\left(p^{-1 / 2}\right)$ amplitude entries, it is composed of “essentially independent” zero-mean random variables and tends to be of full rank and spreads its energy over its $n$ eigenvalues. Spectrum-wise, both $f(\boldsymbol{\mu}, \mathbf{E}) \cdot \mathbf{j} \mathbf{j}{ }^{\top} / p$ and $\mathbf{Z}^{\top} \mathbf{Z} / p$ meet on even ground under the nontrivial classification setting of (1.7).

We shall see in Section 4 that things are actually not as clear-cut and, in particular, that not all choices of kernel functions can achieve the same nontrivial classification rates. In particular, the popular Gaussian (radial basis function [RBF]) kernel will be shown to be largely suboptimal in this respect.

## 计算机代写|机器学习代写machine learning代考|Random Matrix Theory as an Answer

Random matrix theory originates from the work of John Wishart [Wishart, 1928] on the study of the eigenvalues of the matrix $\mathbf{X} \mathbf{X}^{\top}$ (now referred to as a Wishart matrix) for $\mathbf{X} \in \mathbb{R}^{p \times n}$ with standard Gaussian entries $[\mathbf{X}]_{i j} \sim \mathcal{N}(0,1)$. Wishart managed to determine a closed-form expression for the joint eigenvalue distribution of $\mathbf{X X ^ { \top }}$ for every pair of $p, n$. Few progress however followed, as matrices with non-Gaussian entries are hardly amenable to similar analysis and, even if they were, the actual study of more elaborate functionals of $\mathbf{X X}^{\top}$ is at best cumbersome and often simply intractable.

The works of the physicist Eugene Wigner [Wigner, 1955] gave a new impulse to the theory. Interested in the eigenvalues of symmetric matrices $\mathbf{X} \in \mathbb{R}^{n \times n}$ with independent Bernoulli entries (particle spins in his application context), Wigner opted for an asymptotic analysis of the eigenvalue distribution, thereby initiating the important and much richer branch of large-dimensional random matrix theory. Despite this important inspiration, Wigner exploited standard asymptotic statistics tools (the method of moments) to prove that the discrete distribution of the eigenvalues of $\mathbf{X}$ has a continuous semicircle looking density in the $n \rightarrow \infty$ limit (the now popular semicircular law). This approach was particularly convenient as the limiting law is simple and could be visually anticipated (which is not the case of the next-to-come Marčenko-Pastur limiting distribution of Wishart matrices).

Only until 1967 with the tour-de-force of Marčenko and Pastur [1967] did random matrix theory take a new dimension. Marčenko and Pastur determined the limiting spectral distribution of the sample covariance matrix model $\mathbf{X} \mathbf{X}^{\top}$ of Wishart but under relaxed conditions: $[\mathbf{X}]{i j}$ are independent entries with zero mean and unit variance, and additional moment assumptions (all discarded in subsequent works). The independence (or weak dependence) property is key to their proof, which exploits the powerful Stieltjes transform $\frac{1}{p} \operatorname{tr}\left(\frac{1}{n} \mathbf{X} \mathbf{X}^{\top}-z \mathbf{I}_p\right)^{-1}=\int(\lambda-z)^{-1} \mu_p(d t)$ of the empirical spectral distribution $\mu_p \equiv \frac{1}{p} \sum{i=1}^p \delta_{\lambda_i\left(\frac{1}{n} \mathbf{X X}^{\top}\right)}$ of $\frac{1}{n} \mathbf{X} \mathbf{X}^{\top}$, a tool borrowed from operator theory in Hilbert spaces [Akhiezer and Glazman, 2013], rather than the moments $\frac{1}{p} \operatorname{tr}\left(\frac{1}{n} \mathbf{X X}^{\top}\right)^k$ (which may not converge since $\mathbb{E}\left[\mathbf{X}_{i j}^{\ell}\right]$ needs not be finite for $\ell>2$ ).

The technical approach devised by Marčenko and Pastur was then largely embraced at the turn of the twenty-first century by Bai and Silverstein who, in a series of significant breakthroughs (the most noticeable of which are [Silverstein and Bai, 1995, Bai and Silverstein, 1998]), extended the results in [Marčenko and Pastur, 1967] to an exhaustive study of sample covariance matrices.

# 机器学习代考

## 计算机代写|机器学习代写machine learning代考|Explaining Kernel Methods with Random Matrix Theory

$$\mathbf{K}=\exp (-1)\left(\mathbf{1} n \mathbf{1}n^{\top}+\frac{1}{p} \mathbf{Z}^{\top} \mathbf{Z}\right)+f(\boldsymbol{\mu}, \mathbf{E}) \cdot \frac{1}{p} \mathbf{j j}^{\top}++o|\cdot|(1)$$ 在哪里 $\mathbf{Z}=\left[\mathbf{z} 1, \ldots, \mathbf{z}_n\right] \in \mathbb{R}^{p \times n}$ 是高斯㗍声矩阵， $f(\boldsymbol{\mu}, \mathbf{E})=O(1)$ ，和 $\mathbf{j}=[\mathbf{1} n / 2 ;-\mathbf{1 n} / 2]$ 是类 信息”标签”向量（如图 $1.2$ 的设置) 。这里 ${ }^{\prime \prime \prime}$ 表示对当前讨论不重要的额外术语，并且 $o|\cdot|(1)$ 将渐近消 失的算子范数的项表示为 $n, p \rightarrow \infty$. 这里要说明的重要一点是 (i) 根据这个描述， $[\mathbf{K}]{i j}=\exp (-1)\left(1+\mathbf{z}_i^{\top} \mathbf{z}_j / p\right) \pm f(\boldsymbol{\mu}, \mathbf{E}) / p+*$ ， 和
$f(\mu, \mathbf{E}) / p \ll \mathbf{z}_i^{\top} \mathbf{z}_j / p=O\left(p^{-1 / 2}\right)$; 这与我们之前的讨论是一致的：统计信息在条目方面由噪声主 导。
(ii) 从光谱的角度来看， $\left|\mathbf{Z}^{\top} \mathbf{Z} / p\right|=O(1)$ ，根据 Marčenko-Pastur 定理 [Marčenko 和 Pastur，1967] 在第 1.1.2 节中讨论并在图 $1.1$ 中直观地确认，而 $|f(\boldsymbol{\mu}, \mathbf{E}) \cdot \mathbf{j j T} / p|=O(1)$ : 因此，在频谱方面，信 息与橾声持平。

## 计算机代写|机器学习代写machine learning代考|Random Matrix Theory as an Answer

Marčenko 和 Pastur 设计的技术方法在 21 世纪之交被 Bai 和 Silverstein 广泛接受，他们取得了一系列 重大突破 (其中最引人注目的是 [Silverstein 和 Bai， 1995，Bai 和 Silverstein，1998]), 将 [Marčenko 和 Pastur, 1967] 中的结果扩展到对样本协方差矩阵的详尽研究。

