## 统计代写|主成分分析代写Principal Component Analysis代考|Validation

In this section we illustrate our procedure with data from the leukemia data set of Golub et al. (1999) and the lymphoma data set Alizadeh et al. (2000).

In these examples our aim is to validate our procedure for adding input variables information into KPCA representation. We follow the following steps. First, in each data set, we build a list of genes that are differentially expressed. This selection is based in accordance with previous studies such as (Golub et al. (1999), Pittelkow \& Wilson (2003), Reverter et al. (2010)). In addition we compute the expression profile of each gene selected, this profile confirm the evidence of differential expression.

Second, we compute the curves through each sample point associated with each gene in the list. These curves are given by the $\phi$-image of points of the form:
$$\mathbf{y}(s)=\mathbf{x}{i}+s \mathbf{e}{k}$$
where $x_{i}$ is the $1 \times n$ expression vector of the $i$-th sample, $i=1, \ldots, m, k$ denotes the index in the expression matrix of the gene selected to be represented, $\mathbf{e}{k}=(0, \ldots, 1, \ldots, 0)$ is a $1 \times n$ vector with zeros except in the $k$-th. These curves describe locally the change of the sample $x{i}$ induced by the change of the gene expression.

Third, we project the tangent vector of each curve at $s=0$, that is, at the sample points $\mathbf{x}_{i}$, $i=1, \ldots, m$, onto the KPCA subspace spanned by the eigenvectors (9). This representation capture the direction of maximum variation induced in the samples when the expression of gene increases.

By simultaneously displaying both the samples and the gene information on the same plot it is possible both to visually detect genes which have similar profiles and to interpret this pattern by reference to the sample groups.

## 统计代写|主成分分析代写Principal Component Analysis代考|Leukemia data sets

The leukemia data set is composed of 3051 gene expressions in three classes of leukemia: 19 cases of B-cell acute lymphoblastic leukemia (ALL), 8 cases of T-cell ALL and 11 cases of acute myeloid leukemia (AML). Gene expression levels were measured using Affymetrix high-density oligonucleotide arrays.

The data were preprocessed according to the protocol described in Dudoit et al. (2002). In addition, we complete the preprocessing of the gene expression data with a microarray standardization and gene centring.

In this example we perform the KPCA, as detailed in the previous section, we compute the kernel matrix with using the radial basis kernel with $c=0.01$, this value is set heuristically. The resulting plot is given in Figure 1. It shows the projection onto the two leading kernel principal components of microarrays. In this figure we can see that KPCA detect the group structure in reduced dimension. AML, T-cell ALL and B-cell ALL are fully separated by KPCA.

To validate our procedure we select a list of genes differentially expressed proposed by (Golub et al. (1999), Pittelkow \& Wilson (2003), Reverter et al. (2010)) and a list of genes that are not differentially expressed. In particular, in Figures 2, 3,4 and 5 we show the results in the case of genes: X76223_s_at, X82240_rna1_at, Y00787_s_at and D50857_at, respectively. The three first genes belong to the list of genes differentially expressed and the last gene is not differentially expressed.

Figure 2 (top) shows the tangent vectors associated with $\mathrm{X} 76223_{\text {_s_at gene, attached at }}$ each sample point. This vector field reveals upper expression towards T-cell cluster as is expected from references above mentioned. This gene is well represented by the second principal component. The length of the arrows indicate the strength of the gene on the sample position despite the dimension reduction. Figure 2 (bottom) shows the expression profile of X76223_s_at gene. We can observe that X76223_s_at gene is up regulated in T-cell class. This profile is agree with our procedure because the direction in which the expression of the $\mathrm{x} 76223$ _s_at gene increases points to the T-cell cluster.

## 统计代写|主成分分析代写Principal Component Analysis代考|Validation

$$\mathbf{y}(s)=\mathbf{x} i+s \mathbf{e} k$$

