### 统计代写|化学计量学作业代写chemometrics代考|The Machinery

## 统计代写|化学计量学作业代写chemometrics代考|The Machinery

Currently, PCA is implemented even in low-level numerical software such as spreadsheets. Nevertheless, it is good to know the basics behind the computations. In almost all cases, the algorithm used to calculate the PCs is Singular Value Decomposition (SVD). ${ }^{2}$ It decomposes an $n \times p$ mean-centered data matrix $\boldsymbol{X}$ into three parts:
$$\boldsymbol{X}=\boldsymbol{U} \boldsymbol{D} \boldsymbol{V}^{T}$$
where $\boldsymbol{U}$ is a $n \times a$ orthonormal matrix containing the left singular vectors, $\boldsymbol{D}$ is a diagonal matrix $(a \times a$ ) containing the singular values, and $V$ is a $p \times a$ orthonormal matrix containing the right singular vectors. The latter are what in PCA terminology is called the loadings-the product of the first two matrices forms the scores:
$$\boldsymbol{X}=(\boldsymbol{U} \boldsymbol{D}) \boldsymbol{V}^{T}=\boldsymbol{T} \boldsymbol{P}^{T}$$
The interpretation of matrices $T, P, U, D$ and $V$ is straightforward. The loadings, columns in matrix $\boldsymbol{P}$ (or equivalently, the right singular vectors, columns in matrix $V$ ) give the weights of the original variables in the PCs. Variables that have very low values in a specific column of $\boldsymbol{V}$ contribute only very little to that particular latent variable. The scores, columns in $T$, constitute the coordinates in the space of the latent variables. Put differently: these are the coordinates of the samples as we see them from our new PCA viewpoint. The columns in $\boldsymbol{U}$ give the same coordinates in a normalized form – they have unit variances, whereas the columns in $T$ have variances corresponding to the variances of each particular PC. These variances $\lambda_{i}$ are proportional to the squares of the diagonal elements in matrix $\boldsymbol{D}$ :
$$\lambda_{i}=d_{i}^{2} /(n-1)$$
The fraction of variance explained by PC $i$ can therefore be expressed as
$$F V(i)=\lambda_{i} / \sum_{j=1}^{a} \lambda_{j}$$
One main problem in the application of PCA is the decision on how many PCs to retain; we will come back to this in Section 4.3.

## 统计代写|化学计量学作业代写chemometrics代考|Doing It Yourself

Calculating scores and loadings is easy: consider the wine data first. We perform PCA on the autoscaled data to remove the effects of the different scales of the variables using the svd function provided by $R$ :

• wines.svd <- svd (wines. sc)
wines. scores <- wines . svd$u %*% diag(wines.svd$d)

wines. loadings <- wines.svd$v
The first two PCs represent the plane that contains most of the variance; how much exactly is given by the squares of the values on the diagonal of $\boldsymbol{D}$. The importance of individual PCs is usually given by the percentage of the overall variance that is explained:
> wines. vars <- wines . svd$d^2 / (nrow (wines) – 1)
wines.totalvar <- sum (wines.vars)
wines.relvars <- wines. vars / wines.totalvar
variances <- 100 * round (wines . relvars, digits = 3)
variances [1:5]
[1] $36.0 \quad 19.2 \quad 11.2 \quad 7.1 \quad 6.6$
The first PC covers more than one third of the total variance; for the fifth PC this amount is down to one fifteenth.

## 统计代写|化学计量学作业代写chemometrics代考|Scree Plots

The amount of variance per PC is usually depicted in a scree plot: either the variances themselves or the logarithms of the variances are shown as bars. Often, one also considers the fraction of the total variance explained by every single PC. The last few PCs usually contain no information and, especially on a log scale, tend to make the scree plot less interpretable, so they are usually not taken into account in the plot.
$>$ barplot (wines, $\operatorname{vars}[1: 10]$, main = “Variances”,
names.arg = paste (“PC”, 1:10))
barplot(log (wines. vars $[1: 10])$, main = “log (Variances) “,
names.arg = paste (“PC”, 1:10))

barplot(wines. relvars $[1: 10]$, main = “Relative variances”,
names.arg = paste $(” \mathrm{PC} “, 1: 10))$
barplot (cumsum (100 * wines. relvars [1:10]),
main = “Cumulative variances (8)”,
names.arg = paste( "PC", 1:10), ylim = c(0,100))
This leads to the plots in Fig. 4.2. Clearly, PCs 1 and 2 explain much more variance than the others: together they cover $55 \%$ of the variance. The scree plots show no clear cut-off, which in real life is the rule rather than the exception. Depending on the goal of the investigation, for these data one could consider three or five PCs. Choosing four PCs would not make much sense in this case, since the fifth PC would explain almost the same amount of variance: if the fourth is included, the fifth should be, too.

## 统计代写|化学计量学作业代写chemometrics代考|The Machinery

X=在D在吨

X=(在D)在吨=吨磷吨

λ一世=d一世2/(n−1)
PC 解释的方差分数一世因此可以表示为
F在(一世)=λ一世/∑j=1一种λj
PCA 应用的一个主要问题是决定保留多少 PC；我们将在第 4.3 节中回到这一点。

## 统计代写|化学计量学作业代写chemometrics代考|Doing It Yourself

• wines.svd <- svd (wines.sc)
• 葡萄酒。分数 <- 葡萄酒。svd qt

wines.totalvar <- sum (wines.vars)
wines.relvars <- wines。vars / wines.totalvar

[1]36.019.211.27.16.6

## 统计代写|化学计量学作业代写chemometrics代考|Scree Plots

>条形图（葡萄酒，谁的⁡[1:10], main = “Variances”,
names.arg = paste (“PC”, 1:10))
barplot(log (wines.vars[1:10]), main = “log (Variances)”,
names.arg = paste (“PC”, 1:10))

names.arg = paste(”磷C“,1:10))
b条形图（cumsum(100* 葡萄酒。相关人员[1:10])）。
main = “累积方差 (8)”，

