### 统计代写|数据科学代写data science代考|Developments and Applications

## 统计代写|数据科学代写data science代考|Principal Component Analysis

PCA is a data analysis technique that relies on a simple transformation of recorded observation, stored in a vector $\mathbf{z} \in \mathbb{R}^{N}$, to produce statistically independent score variables, stored in $\mathrm{t} \in \mathbb{R}^{n}, n \leq N$ :
$$\mathrm{t}=\mathbf{P}^{T} \mathbf{z}$$
Here, $\mathbf{P}$ is a transformation matrix, constructed from orthonormal column vectors. Since the first applications of $\mathrm{PCA}[21]$, this technique has found its way into a wide range of different application areas, for example signal processing $[75]$, factor analysis $[29,44]$, system identification $[77]$, chemometrics $[20,66]$ and more recently, general data mining $[11,58,70]$ including image processing $[17,72]$ and pattern recognition $[10,47]$, as well as process monitoring and quality control $[1,82]$ including multiway [48], multiblock [52] and

multiscale [3] extensions. This success is mainly related to the ability of PCA to describe significant information/variation within the recorded data typically by the first few score variables, which simplifies data analysis tasks accordingly.

Sylvester $[67]$ formulated the idea behind PCA, in his work the removal of redundancy in bilinear quantics, that are polynomial expressions where the sum of the exponents are of an order greater than 2, and Pearson [51] laid the conceptual basis for PCA by defining lines and planes in a multivariable space that present the closest fit to a given set of points. Hotelling [28] then refined this formulation to that used today. Numerically, PCA is closely related to an eigenvector-eigenvalue decomposition of a data covariance, or correlation matrix and numerical algorithms to obtain this decomposition include the iterative NIPALS algorithm [78], which was defined similarly by Fisher and MacKenzie earlier in $[80]$, and the singular value decomposition. Good overviews concerning $\mathrm{PCA}$ are given in Mardia et al. [45], Joliffe [32]. Wold et al. $[80]$ and Jackson [30].
‘The aim of this article is to review and examine nonlinear extensions of PCA that have been proposed over the past two decades. This is an important research field, as the application of linear PCA to nonlinear data may be inadequate [49]. The first attempts to present nonlinear PCA extensions include a generalization, utilizing a nonmetric scaling, that produces a nonlinear optimization problem [42] and constructing a curves through a given cloud of points, referred to as principal curves [25]. Inspired by the fact that the reconstruction of the original variables, $\widehat{\mathbf{z}}$ is given by:
$$\widehat{\mathbf{z}}=\mathbf{P t}=\overbrace{\mathbf{P} \underbrace{\left(\mathbf{P}^{T} \mathbf{z}\right)}_{\text {mapping }}}^{\text {demapping }},$$
that includes the determination of the score variables (mapping stage) and the determination of $\widehat{\mathbf{z}}$ (demapping stage), Kramer [37] proposed an autoassociative neural network (ANN) structure that defines the mapping and demapping stages by neural network layers. Tan and Mavrovouniotis [68] pointed out, however, that the 5 layers network topology of autoassociative neural networks may be difficult to train, i.e. network weights are difficult to determine if the number of layers increases [27].

To reduce the network complexity, Tan and Mavrovouniotis proposed an input training (IT) network topology, which omits the mapping layer. Thus, only a 3 layer network remains, where the reduced set of nonlinear principal components are obtained as part of the training procedure for establishing the IT network. Dong and McAvoy [16] introduced an alternative approach that divides the 5 layer autoassociative network topology into two 3 layer topologies, which, in turn, represent the nonlinear mapping and demapping functions. The output of the first network, that is the mapping layer, are the score variables which are determined using the principal curve approach.

## 统计代写|数据科学代写data science代考|PCA Preliminaries

where $N$ and $K$ are the number of recorded variables and the number of available observations, respectively. Defining the rows and columns of $\mathbf{Z}$ by vectors $\mathbf{z}{i} \in \mathbb{R}^{N}$ and $\zeta{j} \in \mathbb{R}^{K}$, respectively, $\mathbf{Z}$ can be rewritten as shown below:
$$\mathbf{Z}=\left[\begin{array}{c} \mathbf{z}{1}^{T} \ \mathbf{z}{2}^{T} \ \mathbf{z}{3}^{T} \ \vdots \ \mathbf{z}{i}^{T} \ \vdots \ \mathbf{z}{K-1}^{T} \ \mathbf{z}{K}^{T} \end{array}\right]=\left[\begin{array}{lll} \boldsymbol{\zeta}{1} \ \boldsymbol{\zeta}{2} \end{array} \boldsymbol{\zeta}{3} \cdots \boldsymbol{\zeta}{j} \cdots \boldsymbol{\zeta}{N}\right]$$ The first and second order statisties of the original set variables $\mathbf{z}^{T}=$ $\left(z{1} z_{2} z_{3} \cdots z_{j} \cdots z_{N}\right)$ are:
$$E{\mathbf{z}}=\mathbf{0} \quad E\left{\mathbf{z z}^{T}\right}=\mathbf{S}{Z Z}$$ with the correlation matrix of $\mathbf{z}$ being defined as $\mathbf{R}{Z Z}$.
The PCA analysis entails the determination of a set of score variables $t_{k}, k \in{123 \cdots n}, n \leq N$, by applying a linear transformation of $\mathbf{z}$ :
$$t_{k}=\sum_{j=1}^{N} p_{k j} z_{j}$$
under the following constraint for the parameter vector
$$\begin{gathered} \mathbf{p}{k}^{T}=\left(p{k 1} p_{k 2} p_{k 3} \cdots p_{k j} \cdots p_{k}\right. \ \sqrt{\sum_{j=1}^{N} p_{k j}^{2}}=\left|\mathbf{p}{k}\right|{2}=1 . \end{gathered}$$
Storing the score variables in a vector $\mathbf{t}^{T}=\left(t_{1} t_{2} t_{3} \cdots t_{j} \cdots t_{n}\right), \mathbf{t} \in \mathbb{R}^{n}$ has the following first and second order statistics:
$$E{\mathbf{t}}=\mathbf{0} \quad E\left{\mathbf{t t}^{T}\right}=\mathbf{\Lambda},$$
where $\Lambda$ is a diagonal matrix. An important property of $\mathrm{PCA}$ is that the variance of the score variables represent the following maximum:
$$\lambda_{k}=\arg \max {\mathbf{p}{k}}\left{E\left{t_{k}^{2}\right}\right}=\arg \max {\mathbf{p}{k}}\left{E\left{\mathbf{p}{k}^{T} \mathbf{z z}^{T} \mathbf{p}{k}\right}\right}$$

that is constraint by:
$$E\left{\left(\begin{array}{c} t_{1} \ t_{2} \ t_{3} \ \vdots \ t_{k-1} \end{array}\right) t_{k}\right}=0 \quad\left|\mathbf{p}{k}\right|{2}^{2}-1=0$$
Anderson [2] indicated that the formulation of the above constrained optimization can alternatively be written as:
$$\lambda_{k}=\arg \max {\mathbf{p}}\left{E\left{\mathbf{p}^{T} \mathbf{z z}^{T} \mathbf{p}\right}-\lambda{k}\left(\mathbf{p}^{T} \mathbf{p}-1\right)\right}$$
under the assumption that $\lambda_{k}$ is predetermined. Reformulating (1.11) to determine $\mathbf{p}{k}$ gives rise to: $$\mathbf{p}{k}=\arg \frac{\partial}{\partial \mathbf{p}}\left{E\left{\mathbf{p}^{I} \mathbf{z z ^ { I }} \mathbf{p}\right}-\lambda_{k}\left(\mathbf{p}^{T} \mathbf{p}-1\right)\right}=\mathbf{0}$$
and produces
$$\mathbf{p}{k}=\arg \left{E\left{\mathbf{z z}^{T}\right} \mathbf{p}-2 \lambda{k} \mathbf{p}\right}=\mathbf{0}$$

## 统计代写|数据科学代写data science代考|Nonlinearity Test for PCA Models

This section discusses how to determine whether the underlying structure within the recorded data is linear or nonlinear. Kruger et al. [38] introduced this nonlinearity test using the principle outlined in Fig. 1.1. The left plot in this figure shows that the first principal component describes the underlying linear relationship between the two variables, $z_{1}$ and $z_{2}$, while the right plot describes some basic nonlinear function, indicated by the curve.

By dividing the operating region into several disjunct regions, where the first region is centered around the origin of the coordinate system, a $\mathrm{PCA}$ model can be obtained from the data of each of these disjunct regions. With respect to Fig. 1.1, this would produce a total of $3 \mathrm{PCA}$ models for each disjunct region in both cases, the linear (left plot) and the nonlinear case (right plot). To determine whether a linear or nonlinear variable interrelationship can be extracted from the data, the principle idea is to take advantage of the residual variance in each of the regions. More precisely, accuracy bounds that are based on the residual variance are obtained for one of the $P C A$ models, for example that of disjunct region I, and the residual variance of the remaining $P C A$ models (for disjunct regions II and III) are benchmarked against these bounds. The test is completed if each of the PCA models has been used to determine accuracy bounds which are then benchmarked against the residual variance of the respective remaining $P C A$ models.

The reason of using the residual variance instead of the variance of the retained score variables is as follows. The residual variance is independent of the region if the underlying interrelationship between the original variables is linear, which the left plot in Fig. $1.1$ indicates. In contrast, observations that have a larger distance from the origin of the coordinate system will, by default, produce a larger projection distance from the origin, that is a larger score value. In this respect, observations that are associated with an

adjunct region that are further outside will logically produce a larger variance irrespective of whether the variable interrelationships are linear or nonlinear.
The detailed presentation of the nonlinearity test in the remainder of this section is structured as follows. Next, the assumptions imposed on the nonlinearity test are shown, prior to a detailed discussion into the construction of disjunct regions. Subsection $3.3$ then shows how to obtain statistical confidence limits for the nondiagonal elements of the correlation matrix. This is followed by the definition of the accuracy bounds. Finally, a summary of the nonlinearity test is presented and some example studies are presented to demonstrate the working of this test.

## 统计代写|数据科学代写data science代考|Principal Component Analysis

PCA 是一种数据分析技术，它依赖于记录观察的简单转换，存储在向量中和∈Rñ, 以产生统计上独立的分数变量，存储在吨∈Rn,n≤ñ :

‘本文的目的是回顾和检查过去二十年来提出的 PCA 的非线性扩展。这是一个重要的研究领域，因为线性 PCA 对非线性数据的应用可能不够充分 [49]。提出非线性 PCA 扩展的第一次尝试包括利用非度量标度进行泛化，这会产生非线性优化问题 [42]，并通过给定的点云构建曲线，称为主曲线 [25]。受原始变量重建的启发，和^是（谁）给的：

## 统计代写|数据科学代写data science代考|PCA Preliminaries

PCA 分析需要确定一组分数变量吨ķ,ķ∈123⋯n,n≤ñ，通过应用线性变换和 :

pķ吨=(pķ1pķ2pķ3⋯pķj⋯pķ ∑j=1ñpķj2=|pķ|2=1.

\lambda_{k}=\arg \max {\mathbf{p}{k}}\left{E\left{t_{k}^{2}\right}\right}=\arg \max {\mathbf{ p}{k}}\left{E\left{\mathbf{p}{k}^{T} \mathbf{z z}^{T} \mathbf{p}{k}\right}\right}\lambda_{k}=\arg \max {\mathbf{p}{k}}\left{E\left{t_{k}^{2}\right}\right}=\arg \max {\mathbf{ p}{k}}\left{E\left{\mathbf{p}{k}^{T} \mathbf{z z}^{T} \mathbf{p}{k}\right}\right}

E\left{\left(\begin{array}{c} t_{1} \ t_{2} \ t_{3} \ \vdots \ t_{k-1} \end{array}\right) t_{k }\right}=0 \quad\left|\mathbf{p}{k}\right|{2}^{2}-1=0E\left{\left(\begin{array}{c} t_{1} \ t_{2} \ t_{3} \ \vdots \ t_{k-1} \end{array}\right) t_{k }\right}=0 \quad\left|\mathbf{p}{k}\right|{2}^{2}-1=0
Anderson [2] 指出，上述约束优化的公式也可以写成：
\lambda_{k}=\arg \max {\mathbf{p}}\left{E\left{\mathbf{p}^{T} \mathbf{z z}^{T} \mathbf{p}\right} -\lambda{k}\left(\mathbf{p}^{T} \mathbf{p}-1\right)\right}\lambda_{k}=\arg \max {\mathbf{p}}\left{E\left{\mathbf{p}^{T} \mathbf{z z}^{T} \mathbf{p}\right} -\lambda{k}\left(\mathbf{p}^{T} \mathbf{p}-1\right)\right}

\mathbf{p}{k}=\arg \left{E\left{\mathbf{z z}^{T}\right} \mathbf{p}-2 \lambda{k} \mathbf{p}\right} =\mathbf{0}\mathbf{p}{k}=\arg \left{E\left{\mathbf{z z}^{T}\right} \mathbf{p}-2 \lambda{k} \mathbf{p}\right} =\mathbf{0}

