## 计算机代写|机器学习代写machine learning代考|Concluding Remarks

Before the present chapter, the first part of the book was mostly concerned with the sample covariance matrix model $\mathbf{X} \mathbf{X}^{\top} / n$ (and more marginally with the Wigner model $\mathbf{X} / \sqrt{n}$ for symmetric $\mathbf{X}$ ), where the columns of $\mathbf{X}$ are independent and the entries of each column are independent or linearly dependent. Historically, this model and its numerous variations (with a variance profile, with right-side correlation, summed up to other independent matrices of the same form, etc.) have covered most of the mathematical and applied interest of the first two decades (since the early nineties) of intense random matrix advances. The main drivers for these early developments were statistics, signal processing, and wireless communications. The present chapter leaped much further in considering now random matrix models with possibly highly correlated entries, with a specific focus on kernel matrices. When (moderately) largedimensional data are considered, the intuition and theoretical understanding of kernel matrices in small-dimensional setting being no longer accurate, random matrix theory provides accurate (and asymptotically exact) performance assessment along with the possibility to largely improve the performance of kernel-based machine learning methods. This, in effect, creates a small revolution in our understanding of machine learning on realistic large datasets.

A first important finding of the analysis of large-dimensional kernel statistics reported here is the ubiquitous character of the Marčenko-Pastur and the semi-circular laws. As a matter of fact, all random matrix models studied in this chapter, and in particular the kernel regimes $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j / p\right)$ (which concentrate around $f(0)$ ) and $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j / \sqrt{p}\right.$ ) (which tends to $f(\mathcal{N}(0,1))$ ), have a limiting eigenvalue distribution akin to a combination of the two laws. This combination may vary from case to case (compare for instance the results of Practical Lecture 3 to Theorem 4.4), but is often parametrized in a such way that the Marčenko-Pastur and semicircle laws appear as limiting cases (in the context of Practical Lecture 3, they correspond to the limiting cases of dense versus sparse kernels, and in Theorem $4.4$ to the limiting cases of linear versus “purely” nonlinear kernels).

## 计算机代写|机器学习代写machine learning代考|Practical Course Material

In this section, Practical Lecture 3 (that evaluates the spectral behavior of uniformly sparsified kernels) related to the present Chapter 4 is discussed, where we shall see, as for $\alpha-\beta$ and properly scaling kernels in Sections $4.2 .4$ and $4.3$ that, depending on the “level of sparsity,” a combination of Marčenko-Pastur and semicircle laws is observed.
Practical Lecture Material 3 (Complexity-performance trade-off in spectral clustering with sparse kernel, Zarrouk et al. [2020]). In this exercise, we study the spectrum of a “punctured” version $\mathbf{K}=\mathbf{B} \odot\left(\mathbf{X}^{\top} \mathbf{X} / p\right.$ ) (with the Hadamard product $[\mathbf{A} \odot \mathbf{B}]{i j}=[\mathbf{A}]{i j}[\mathbf{B}]{i j}$ of the linear kernel $\mathbf{X}^{\top} \mathbf{X} / p$, with data matrix $\mathbf{X} \in \mathbb{R}^{p \times n}$ and a symmetric random mask-matrix $\mathbf{B} \in{0,1}^{n \times n}$ having independent $[\mathbf{B}]{i j} \sim \operatorname{Bern}(\boldsymbol{\epsilon})$ entries for $i \neq j$ (up to symmetry) and $[\mathbf{B}]_{i i}=b \in{0,1}$ fixed, in the limit $p, n \rightarrow \infty$ with $p / n \rightarrow c \in(0, \infty)$. This matrix mimics the computation of only a proportion $\epsilon \in(0,1)$ of the entries of $\mathbf{X}^{\top} \mathbf{X} / n$, and its impact on spectral clustering. Letting $\mathbf{X}=\left[\mathbf{x}_1, \ldots, \mathbf{x}_n\right]$ with $\mathbf{x}_i$ independently and uniformly drawn from the following symmetric two-class Gaussian mixture
$$\mathcal{C}_1: \mathbf{x}_i \sim \mathcal{N}\left(-\boldsymbol{\mu}, \mathbf{I}_p\right), \quad \mathcal{C}_2: \mathbf{x}_i \sim \mathcal{N}\left(+\boldsymbol{\mu}, \mathbf{I}_p\right)$$
for $\boldsymbol{\mu} \in \mathbb{R}^p$ such that $|\boldsymbol{\mu}|=O(1)$ with respect to $n, p$, we wish to study the effect of a uniform “zeroing out” of the entries of $\mathbf{X}^{\top} \mathbf{X}$ on the presence of an isolated spike in the spectrum of $\mathbf{K}$, and thus on the spectral clustering performance.

We will study the spectrum of $\mathbf{K}$ using Stein’s lemma and the Gaussian method discussed in Section 2.2.2. Let $\mathbf{Z}=\left[\mathbf{z}1, \ldots, \mathbf{z}_n\right] \in \mathbb{R}^{p \times n}$ for $\mathbf{z}_i=\mathbf{x}_i-(-1)^a \boldsymbol{\mu} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)$ with $\mathbf{x}_i \in \mathcal{C}_a$ and $\mathbf{M}=\mu \mathbf{j}^{\top}$ with $\mathbf{j}=\left[-\mathbf{1}{n / 2}, \mathbf{1}_{n / 2}\right]^{\top} \in \mathbb{R}^n$ so that $\mathbf{X}=\mathbf{M}+\mathbf{Z}$. First show that, for $\mathbf{Q} \equiv \mathbf{Q}(z)=\left(\mathbf{K}-z \mathbf{I}_n\right)^{-1}$,
\begin{aligned} \mathbf{Q}= & -\frac{1}{z} \mathbf{I}_n+\frac{1}{z}\left(\frac{\mathbf{Z}^{\boldsymbol{}} \mathbf{Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{Z}^{\boldsymbol{T}} \mathbf{M}}{p} \odot \mathbf{B}\right) \mathbf{Q} \ & +\frac{1}{z}\left(\frac{\mathbf{M}^{\boldsymbol{\top}} \mathbf{Z}}{p} \odot \mathbf{B}\right) \mathbf{Q}+\frac{1}{z}\left(\frac{\mathbf{M}^{\boldsymbol{T}} \mathbf{M}}{p} \odot \mathbf{B}\right) \mathbf{Q} . \end{aligned}
To proceed, we need to go slightly beyond the study of these four terms.

