## 统计代写|数据结构作业代写data structure代考|Clusters and Flat Clustering

Clusters are groups of points that are similar to each other and dissimilar to points from other clusters. In terms of the underlying distribution, a cluster constitutes a connected area of high density around a mode of the distribution. Clusters may be determined automatically by clustering algorithms providing a flat clustering, or visually relying on the ability of human cognition to identify groups (see Gestalt laws of proximity and continuity detailed Sect. 1.3.2). Indeed, by looking at Fig. 1.7a, the reader gets an intuitive idea of what the clusters are for this dataset (a priori close to the automatic clustering of Fig. 1.8b).

Clustering algorithms identify a latent categorical variable indicating the cluster to which a given point belongs. Namely, they determine a mapping $\Omega: \mathcal{D} \longrightarrow \mathcal{L}$ assigning each data point $\xi_i$ to a category with a label $L_i=\Omega\left(\xi_i\right)$. The number of clusters, that is the number of possible values of that categorical variable, is a key parameter for a flat clustering. We may distinguish two main approaches for clustering of multidimensional data: the parametric approach used by partitioning algorithms and the density-based approach. For network data, the equivalent of clustering is community detection. In terms of graphs, communities (i.e. clusters) may be defined as groups of vertices linked together by many edges and linked to their surroundings by less edges [19].
Parametric Clustering
Partitioning algorithms, such as $k$-means [118] and $k$-medoids [96] split the space into $k$ convex regions parametrized by associated prototypes. Indeed, they assign each point of the datase to one of the clusters, so as to minimize the distances separating points from their clusters prototype. This prototype, which is respectively a centroid for $k$-means and a medoid for the $k$-medoids, provides a central tendency of the cluster. Formally, those algorithms seek the clustering that minimizes the cumulated Fréchet variance of all clusters, measured around their respective Fréchet means, which is the aforementioned prototype.

## 统计代写|数据结构作业代写data structure代考|Latent Variables Extraction and Manifold Learning

In the i.i.d hypothesis, the support of the theoretical probability distribution generating data points $\left{\xi_i\right}$ is considered as a manifold $\mathcal{M}$ immersed in the ambient data space $\mathcal{D}[9,81]$. The repartition of points along a manifold may be explained by the strong dependency between data space variables. In addition, one may assume that all these variables are local functions of a few independent latent variables with an additional noise [176], thus constituting a low-dimensional structure. That noise may induce small variations around the smooth structure of that manifold. Note that the manifold hypothesis may extend to datasets that are not generated by random processes. For instance, for the two open boxes and COIL-20 datasets (see Sect. 1.1.7), data lie on a low-dimensional manifold which is regularly sampled, and not randomly sampled.

Dimensionality Reduction (DR) in general aim at finding a mapping $\Phi: \mathcal{D} \longrightarrow$ $\mathcal{E}$, that associates each data point $\xi_i$ to a point $x_i=\Phi\left(\xi_i\right)$ in a low dimensional embedding space $\mathcal{E}$. A key parameter of dimensionality reduction is the embedding dimensionality $d$ (i.e. the dimensionality of $\mathcal{E}$ ). We distinguish here two sub-cases of $\mathrm{DK}$ : manifold learning and spatialization. The ideal goal of manifold learning is to extract latent variables parametrizing the manifold, which explain the variability of data. Those hypothetical variables may also be referred to as curvilinear components of the manifold [54]. In that case, the embedding dimensionality defines the number of variables to extract. A possible value for that parameter is the intrinsic dimensionality, which corresponds locally to the number of curvilinear components require to parametrize the manifold (see Sect. 2.2). Manifold learning may be used as a pre-processing step for other machine learning applications (e.g., classification or clustering), in order to mitigate the curse of dimensionality [155], to compress the data [179], or to filter out the noise [176]. Inversely, spatialization aims at providing a visual representation of high-dimensional data (see Sect. 1.3.2). As a result, the embedding dimensionality is constrained by the perceptual capabilities of the data analyst, limiting the number of dimensions to at most three for visualization with only one scatter plot. Satisfying this strong constraint on dimensionality often requires distortions of the underlying data structure. Note that the equivalent of DR for network data is graph embedding (also called graph layout).

