标签： CS7641

计算机代写|机器学习代写machine learning代考|COMP5318

Posted on 2022年12月27日2022年12月27日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

机器学习是一个致力于理解和建立 “学习 “方法的研究领域，也就是说，利用数据来提高某些任务的性能的方法。机器学习算法基于样本数据（称为训练数据）建立模型，以便在没有明确编程的情况下做出预测或决定。机器学习算法被广泛用于各种应用，如医学、电子邮件过滤、语音识别和计算机视觉，在这些应用中，开发传统算法来执行所需任务是困难的或不可行的。

statistics-lab™ 为您的留学生涯保驾护航在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|The Pitfalls of Large-Dimensional Statistics

The big data revolution comes along with the challenging needs to parse, mine, and compress a large amount of large-dimensional and possibly heterogeneous data. In many applications, the dimension $p$ of the observations is as large as $-$ if not much larger than – their number $n$. In array processing and wireless communications, the number of antennas required for fine localization resolution or increased communication throughput may be as large (today in the order of hundreds) as the number of available independent signal observations [Li and Stoica, 2007, Lu et al., 2014]. In genomics, the identification of correlations among hundreds of thousands of genes based on a limited number of independent (and expensive) samples induces an even larger ratio $\mathrm{p} / \mathrm{n}$ [Arnold et al., 1994]. In statistical finance, portfolio optimization relies on the need to invest on a large number $p$ of assets to reduce volatility but at the same time to estimate the current (rather than past) asset statistics from a relatively small number $n$ of asset return records [Laloux et al., 2000].

As we shall demonstrate in the following section, the fact that in these problems $n$ is not much larger than $p$ annihilates most of the results from standard asymptotic statistics that assume $n$ alone is large [Vaart, 2000]. As a rule of thumb, by “much larger” we mean here that $n$ must be at least 100 times larger than $p$ for standard asymptotic statistics to be of practical convenience (see our argument in Section 1.1.2). Many algorithms in statistics, signal processing, and machine learning are precisely derived from this $n \gg p$ assumption that is no longer appropriate today. A major objective of this book is to cast some light on the resulting biases and problems incurred and to provide a systematic random matrix framework to improve these algorithms.

Possibly more importantly, we will see in this book that (small $p$ ) small-dimensional intuitions at the core of many machine learning algorithms (starting with spectral clustering [Ng et al., 2002, Luxburg, 2007]) may strikingly fail when applied in a simultaneously large $n, p$ setting. A compelling example lies in the notion of “distance” between vectors. Most classification methods in machine learning are rooted in the observation that random data vectors arising from a mixture distribution (say Gaussian) gather in “groups” of close-by vectors in the Euclidean norm. When dealing with large-dimensional data, however, concentration phenomena arise that make Euclidean distances useless, if not counterproductive: Vectors from the same mixture class may be further away in Euclidean distance than vectors arising from different classes. While classification may still be doable, it works in a rather different way from our small-dimensional intuition. The book intends to prepare the reader for the multiple traps caused by this “curse of dimensionality.”

计算机代写|机器学习代写machine learning代考|Sample Covariance Matrices in the Large n,p Regime

Let us consider the following example that illustrates a first elementary, yet counterintuitive, result: For simultaneously large $n, p$, the sample covariance matrix $\hat{\mathbf{C}} \in \mathbb{R}^{p \times p}$ based on $n$ samples $\mathbf{x}i \sim \mathcal{N}(\mathbf{0}, \mathbf{C})$ is an entry-wise consistent estimator of the population covariance $\mathbf{C} \in \mathbb{R}^{p \times p}$ (i.e., $|\hat{\mathbf{C}}-\mathbf{C}|{\infty} \rightarrow 0$ as $p, n \rightarrow \infty$ for $|\mathbf{A}|_{\infty} \equiv \max {i j}\left|\mathbf{A}{i j}\right|$ ) while overall being an extremely poor estimator in a (more practical) operator norm sense (i.e., $|\hat{\mathbf{C}}-\mathbf{C}| \nrightarrow 0$, with $|\cdot|$ being the operator norm here). Matrix norms are, in particular, not equivalent in the large $n, p$ scenario.

Let us detail this claim, in the simplest case where $\mathbf{C}=\mathbf{I}p$. Consider a dataset $\mathbf{X}=\left[\mathbf{x}_1, \ldots, \mathbf{x}_n\right] \in \mathbb{R}^{p \times n}$ of $n$ independent and identically distributed (i.i.d.) observations from a $p$-dimensional standard Gaussian distribution, that is, $\mathbf{x}_i \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)$ for $i \in{1, \ldots, n}$. We wish to estimate the population covariance matrix $\mathbf{C}=\mathbf{I}_p$ from the $n$ available samples. The maximum likelihood estimator in this zero-mean Gaussian setting is the sample covariance matrix $\hat{\mathbf{C}}$ defined by $$ \hat{\mathbf{C}}=\frac{1}{n} \sum{i=1}^n \mathbf{x}_i \mathbf{x}_i^{\top}=\frac{1}{n} \mathbf{X} \mathbf{X}^{\top}
$$
By the strong law of large numbers, for fixed $p, \hat{\mathbf{C}} \rightarrow \mathbf{I}_p$ almost surely as $n \rightarrow \infty$, so that $\left|\hat{\mathbf{C}}-\mathbf{I}_p\right| \stackrel{\text { a.s. }}{\longrightarrow} 0$ holds for any standard matrix norm and in particular for the operator norm.

One must be more careful when dealing with the case $n, p \rightarrow \infty$ with the ratio $p / n \rightarrow$ $c \in(0, \infty)$ (or, from a practical standpoint, $n$ is not much larger than $p$ ). First, note that the entry-wise convergence still holds since, invoking the law of large numbers again,
$$
[\hat{\mathbf{C}}]{i j}=\frac{1}{n} \sum{l=1}^n[\mathbf{X}]{i l}[\mathbf{X}]{j l} \stackrel{\text { a.s. }}{\longrightarrow} \begin{cases}1, & i=j \ 0, & i \neq j .\end{cases}
$$
Besides, by a concentration inequality argument, it can even be shown that
$$
\max {1 \leq i, j \leq p}\left|\left[\hat{\mathbf{C}}-\mathbf{I}_p\right]{i j}\right| \stackrel{\text { a.s. }}{\longrightarrow} 0
$$

which holds as long as $p$ is no larger than a polynomial function of $n$, and thus:
$$
\left|\hat{\mathbf{C}}-\mathbf{I}p\right|{\infty} \stackrel{\text { a.s. }}{\longrightarrow} 0 \text {. }
$$
Consider now the case $p>n$. Since $\hat{\mathbf{C}}=\frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^{\top}$ is the sum of $n$ rank-one matrices, the rank of $\hat{\mathbf{C}}$ is at most equal to $n$ and thus, being a $p \times p$ matrix with $p>n$, the sample covariance matrix $\hat{\mathbf{C}}$ must be a singular matrix having at least $p-n>0$ null eigenvalues. As a consequence,
$$
\left|\hat{\mathbf{C}}-\mathbf{I}_p\right| \not \neg
$$
for $|\cdot|$ the matrix operator (or spectral) norm.

机器学习代考

计算机代写|机器学习代写machine learning代考|The Pitfalls of Large-Dimensional Statistics

大数据革命伴随着解析、挖掘和压缩大量大维和可能异构数据的挑战性需求。在许多应用中，维度p的意见是一样大−如果不大于 – 他们的数量n. 在阵列处理和无线通信中，精细定位分辨率或增加通信吞吐量所需的天线数量可能与可用的独立信号观测数量一样大（如今约为数百个）[Li and Stoica, 2007, Lu et al ., 2014]。在基因组学中，基于有限数量的独立（且昂贵）样本识别数十万个基因之间的相关性会导致更大的比率p/n[阿诺德等人，1994 年]。在统计金融中，投资组合优化依赖于大量投资的需要p的资产以减少波动性，但同时从相对较小的数量估计当前（而不是过去）的资产统计数据n资产回报记录 [Laloux et al., 2000]。

正如我们将在下一节中证明的那样，在这些问题中n不比p消除了假设的标准渐近统计的大部分结果n单独很大 [Vaart, 2000]。根据经验，我们这里所说的“大得多”是指n必须至少大于 100 倍p为了使标准渐近统计具有实际便利性（请参阅我们在第 1.1.2 节中的论点）。统计学、信号处理和机器学习中的许多算法正是源自于此n≫p今天不再适用的假设。本书的一个主要目标是阐明由此产生的偏差和产生的问题，并提供一个系统的随机矩阵框架来改进这些算法。

可能更重要的是，我们将在本书中看到（小p) 作为许多机器学习算法核心的小维直觉（从谱聚类开始 [Ng 等人，2002 年，Luxburg，2007 年]）在同时应用于大n,p环境。一个引人注目的例子是向量之间的“距离”概念。机器学习中的大多数分类方法都基于这样的观察，即混合分布（例如高斯分布）产生的随机数据向量聚集在欧几里得范数中的邻近向量“组”中。然而，在处理大维数据时，会出现集中现象，这使得欧氏距离即使不会适得其反也毫无用处：来自同一混合类的向量在欧氏距离上可能比来自不同类的向量更远。虽然分类可能仍然可行，但它的工作方式与我们的小维度直觉截然不同。本书旨在让读者为这种“维度灾难”造成的多重陷阱做好准备。

计算机代写|机器学习代写machine learning代考|Sample Covariance Matrices in the Large n,p Regime

让我们考虑以下示例，该示例说明了第一个基本但违反直觉的结果：同时大 $n, p$ ，样本协方差矩阵 $\hat{\mathbf{C}} \in \mathbb{R}^{p \times p}$ 基于 $n$ 样本 $\mathbf{x} i \sim \mathcal{N}(\mathbf{0}, \mathbf{C})$ 是总体协方差的逐项一致估计量 $\mathbf{C} \in \mathbb{R}^{p \times p}$ (IE， $|\hat{\mathbf{C}}-\mathbf{C}| \infty \rightarrow 0$ 作为 $p, n \rightarrow \infty$ 为了 $|\mathbf{A}|{\infty} \equiv \max i j|\mathbf{A} i j|$ ) 而在 (更实际的) 操作员范数意义上总体上是一个极差的估计器（即， $|\hat{\mathbf{C}}-\mathbf{C}| \nrightarrow 0$ ，和 $|\cdot|$ 在这里成为运营商规范）。特别是，矩阵范数在大 $n, p$ 设想。让我们详细说明这个说法，在最简单的情况下 $\mathbf{C}=\mathbf{I} p$. 考虑一个数据集 $\mathbf{X}=\left[\mathbf{x}_1, \ldots, \mathbf{x}_n\right] \in \mathbb{R}^{p \times n}$ 的 $n$ 来自 $\mathrm{a}$ 的独立同分布 (iid) 观察 $p$ 维标准高斯分布，即 $\mathbf{x}_i \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}_p\right)$ 为了 $i \in 1, \ldots, n$. 我们布望估计总体协方差矩阵 $\mathbf{C}=\mathbf{I}_p$ 来自 $n$ 可用样品。此零均值高斯设置中的最大似然估计量是样本协方差矩阵 $\hat{\mathbf{C}}^{\text {被 }}$ 定义为 $$ \hat{\mathbf{C}}=\frac{1}{n} \sum i=1^n \mathbf{x}_i \mathbf{x}_i^{\top}=\frac{1}{n} \mathbf{X X}^{\top} $$ 由大数定律，对于固定的 $p, \hat{\mathbf{C}} \rightarrow \mathbf{I}_p$ 几乎肯定是 $n \rightarrow \infty$ ，以便 $\left|\hat{\mathbf{C}}-\mathbf{I}_p\right| \stackrel{\text { a.s. }}{\longrightarrow} 0$ 适用于任何标准矩阵范数，尤其适用于算子范数。办案更要慎重 $n, p \rightarrow \infty$ 与比率 $p / n \rightarrow c \in(0, \infty)$ (或者，从实际的角度来看， $n$ 不比 $p$ ). 首先，请注意入口方向的收敛仍然成立，因为再次调用大数定律， $$ [\hat{\mathbf{C}}] i j=\frac{1}{n} \sum l=1^n[\mathbf{X}] i l[\mathbf{X}] j l \stackrel{\text { a.s. }}{\longrightarrow}{1, \quad i=j 0, \quad i \neq j . $$ 此外，通过集中不等式的论证，甚至可以证明 $$ \max 1 \leq i, j \leq p\left|\left[\hat{\mathbf{C}}-\mathbf{I}_p\right] i j\right| \stackrel{\text { a.s. }}{\longrightarrow} 0 $$ 只要 $p$ 不大于的多项式函数 $n$ ，因此: $$ |\hat{\mathbf{C}}-\mathbf{I} p| \infty \stackrel{\text { a.s. }}{\longrightarrow} 0 . $$ 考虑现在的情况 $p>n$. 自从 $\hat{\mathbf{C}}=\frac{1}{n} \sum{i=1}^n \mathbf{x}_i \mathbf{x}_i^{\top}$ 是总和 $n$ 秩矩阵，秩为 $\hat{\mathbf{C}}$ 至多等于 $n$ 因此，作为一个 $p \times p$ 矩阵与 $p>n$ ，样本协方差矩阵 $\hat{\mathbf{C}}$ 必须是奇异矩阵至少有 $p-n>0$ 空特征值。作为结果，
$$
\left|\hat{\mathbf{C}}-\mathbf{I}_p\right| \not
$$
为了 $|\cdot|$ 矩阵运算符 (或谱) 范数。

计算机代写|机器学习代写machine learning代考请认准statistics-lab™

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

金融工程是使用数学技术来解决金融问题。金融工程使用计算机科学、统计学、经济学和应用数学领域的工具和知识来解决当前的金融问题，以及设计新的和创新的金融产品。

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

术语广义线性模型（GLM）通常是指给定连续和/或分类预测因素的连续响应变量的常规线性回归模型。它包括多元线性回归，以及方差分析和方差分析（仅含固定效应）。

有限元方法代写

有限元方法（FEM）是一种流行的方法，用于数值解决工程和数学建模中出现的微分方程。典型的问题领域包括结构分析、传热、流体流动、质量运输和电磁势等传统领域。

有限元是一种通用的数值方法，用于解决两个或三个空间变量的偏微分方程（即一些边界值问题）。为了解决一个问题，有限元将一个大系统细分为更小、更简单的部分，称为有限元。这是通过在空间维度上的特定空间离散化来实现的，它是通过构建对象的网格来实现的：用于求解的数值域，它有有限数量的点。边界值问题的有限元方法表述最终导致一个代数方程组。该方法在域上对未知函数进行逼近。[1] 然后将模拟这些有限元的简单方程组合成一个更大的方程系统，以模拟整个问题。然后，有限元通过变化微积分使相关的误差函数最小化来逼近一个解决方案。

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

随机分析代写

随机微积分是数学的一个分支，对随机过程进行操作。它允许为随机过程的积分定义一个关于随机过程的一致的积分理论。这个领域是由日本数学家伊藤清在第二次世界大战期间创建并开始的。

时间序列分析代写

随机过程，是依赖于参数的一组随机变量的全体，参数通常是时间。随机变量是随机现象的数量表现，其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值（如1秒，5分钟，12小时，7天，1年），因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中，往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录，以得到其自身发展的规律。

回归分析代写

多元回归分析渐进（Multiple Regression Analysis Asymptotics）属于计量经济学领域，主要是一种数学上的统计分析方法，可以分析复杂情况下各影响因素的数学关系，在自然科学、社会和经济学等多个领域内应用广泛。

MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习和应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP4702

Posted on 2022年12月24日2022年12月24日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|Text Clustering

Text clustering methods partition the corpus into groups of related documents belonging to particular topics or categories. However, these categories are not known a priori, because specific examples of desired categories (e.g., politics) of documents are not provided up front. Such learning problems are also referred to as unsupervised, because no guidance is provided to the learning problem. In supervised applications, one might provide examples of news articles belonging to several natural categories like sports, politics, and so on. In the unsupervised setting, the documents are partitioned into similar groups, which is sometimes achieved with a domain-specific similarity function like the cosine measure. In most cases, an optimization model can be formulated, so that some direct or indirect measure of similarity within a cluster is maximized. A detailed discussion of clustering methods is provided in Chapter 4.

Many matrix factorization methods like probabilistic latent semantic analysis and latent Dirichlet allocation also achieve a similar goal of assigning documents to topics, albeit in a soft and probabilistic way. A soft assignment refers to the fact that the probability of assignment of each document to a cluster is determined rather than a hard partitioning of the data into clusters. Such methods not only assign documents to topics but also infer the significance of the words to various topics. In the following, we provide a brief overview of various clustering methods.

Most forms of non-negative matrix factorization methods can be used for clustering text data. Therefore, certain types of matrix factorization methods play the dual role of clustering and dimensionality reduction, although this is not true across every matrix factorization method. Many forms of non-negative matrix factorization are probabilistic mixture models, in which the entries of the document-term matrix are assumed to be generated by a probabilistic process. The parameters of this random process can then be estimated in order to create a factorization of the data, which has a natural probabilistic interpretation. This type of model is also referred to as a generative model because it assumes that the document-term matrix is created by a hidden generative process, and the data are used to estimate the parameters of this process.

计算机代写|机器学习代写machine learning代考|Similarity-Based Algorithms

Similarity-based algorithms are typically either representative-based methods or hierarchical methods, In all these cases, a distance or similarity function between points is used to partition them into clusters in a deterministic way. Representative-based algorithms use representatives in combination with similarity functions in order to perform the clustering. The basic idea is that each cluster is represented by a multi-dimensional vector, which represents the “typical” frequency of words in that cluster. For example, the centroid of a set of documents can be used as its representative. Similarly, clusters can be created by assigning documents to their closest representatives such as the cosine similarity. Such algorithms often use iterative techniques in which the cluster representatives are extracted as central points of clusters, whereas the clusters are created from these representatives by using cosine similarity-based assignment. This two-step process is repeated to convergence, and the corresponding algorithm is also referred to as the $k$-means algorithm. There are many variations of representative-based algorithms although only a small subset of them work with the sparse and high-dimensional representation of text. Nevertheless, one can use a broader variety of methods if one is willing to transform the text data to a reduced representation with dimensionality reduction techniques.

In hierarchical clustering algorithms, similar pairs of clusters are aggregated into larger clusters using an iterative approach. The approach starts by assigning each document to its own cluster and then merges the closest pair of clusters together. There are many variations in terms of how the pairwise similarity between clusters is computed, which has a direct impact on the type of clusters discovered by the algorithm. In many cases, hierarchical clustering algorithms can be combined with representative clustering methods to create more robust methods.

机器学习代考

计算机代写|机器学习代写machine learning代考|Text Clustering

文本聚类方法将语料库划分为属于特定主题或类别的相关文档组。然而，这些类别不是先验已知的，因为没有预先提供所需文档类别（例如，政治）的具体示例。这样的学习问题也被称为无监督的，因为没有为学习问题提供指导。在受监督的应用程序中，人们可能会提供属于几个自然类别（如体育、政治等）的新闻文章示例。在无监督设置中，文档被划分为相似的组，这有时是通过余弦度量等特定领域的相似性函数来实现的。在大多数情况下，可以制定优化模型，以便最大化集群内某些直接或间接的相似性度量。

许多矩阵分解方法，如概率潜在语义分析和潜在 Dirichlet 分配，也实现了将文档分配给主题的类似目标，尽管是以一种软的和概率的方式。软分配指的是每个文档分配到一个集群的概率是确定的，而不是将数据硬划分到集群中。这些方法不仅将文档分配给主题，而且还推断出单词对各种主题的重要性。下面，我们简要概述各种聚类方法。

大多数形式的非负矩阵分解方法都可用于聚类文本数据。因此，某些类型的矩阵分解方法具有聚类和降维的双重作用，尽管并非所有矩阵分解方法都如此。许多形式的非负矩阵分解都是概率混合模型，其中假定文档术语矩阵的条目是由概率过程生成的。然后可以估计此随机过程的参数，以创建具有自然概率解释的数据分解。这种类型的模型也称为生成模型，因为它假设文档-术语矩阵是由隐藏的生成过程创建的，并且数据用于估计该过程的参数。

计算机代写|机器学习代写machine learning代考|Similarity-Based Algorithms

基于相似性的算法通常是基于代表性的方法或分层方法。在所有这些情况下，点之间的距离或相似性函数用于以确定性方式将它们划分为聚类。基于代表的算法将代表与相似性函数结合使用以执行聚类。基本思想是每个集群都由一个多维向量表示，它表示该集群中单词的“典型”频率。例如，一组文档的质心可以作为其代表。类似地，可以通过将文档分配给它们最接近的代表（例如余弦相似度）来创建聚类。此类算法通常使用迭代技术，其中聚类代表被提取为聚类的中心点，而集群是通过使用基于余弦相似性的分配从这些代表创建的。重复此两步过程直至收敛，相应的算法也称为k-意味着算法。基于代表性的算法有很多变体，尽管它们中只有一小部分适用于文本的稀疏和高维表示。然而，如果愿意使用降维技术将文本数据转换为简化表示，则可以使用更广泛的方法。

在层次聚类算法中，类似的集群对使用迭代方法聚合成更大的集群。该方法首先将每个文档分配给它自己的集群，然后将最近的一对集群合并在一起。聚类之间的成对相似性的计算方式有很多变化，这直接影响算法发现的聚类类型。在许多情况下，层次聚类算法可以与代表性聚类方法相结合，以创建更稳健的方法。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP30027

Posted on 2022年12月24日2022年12月24日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|COMP30027

计算机代写|机器学习代写machine learning代考|Text Preprocessing and Similarity Computation

Text preprocessing is required to convert the unstructured format into a structured and multidimensional representation. Text often co-occurs with a lot of extraneous data such as tags, anchor text, and other irrelevant features. Furthermore, different words have different significance in the text domain. For example, commonly occurring words such as “a,” “an,”

and “the,” have little significance for text mining purposes. In many cases, words are variants of one another because of the choice of tense or plurality. Some words are simply misspellings. The process of converting a character sequence into a sequence of words (or tokens) is referred to as tokenization. Note that each occurrence of a word in a document is a token, even if it occurs more than once in the document. Therefore, the occurrence of the same word three times will create three corresponding tokens. The process of tokenization often requires a substantial amount of domain knowledge about the specific language at hand, because the word boundaries have ambiguities caused by vagaries of punctuation in different languages.
Some common steps for preprocessing raw text are as follows:

Text extraction: In cases where the source of the text is the Web, it occurs in combination with various other types of data such as anchors, tags, and so on. Furthermore, in the Web-centric setting, a specific page may contain a (useful) primary block and other blocks that contain advertisements or unrelated content. Extracting the useful text from the primary block is important for high-quality mining. These types of settings require specialized parsing and extraction techniques.
Stop-word removal: Stop words are commonly occurring words that have little discriminative power for the mining process. Common pronouns, articles, and prepositions are considered stop words. Such words need to be removed to improve the mining process.
Stemming, case-folding, and punctuation: Words with common roots are consolidated into a single representative. For example, words like “sinking” and “sank” are consolidated into the single token “sink.” The case (i.e., capitalization) of the first alphabet of a word may or may not be important to its semantic interpretation. For example, the word “Rose” might either be a flower or the name of a person depending on the case. In other settings, the case may not be important to the semantic interpretation of the word because it is caused by grammar-specific constraints like the beginning of a sentence. Therefore, language-specific heuristics are required in order to make decisions on how the case is treated. Punctuation marks such as hyphens need to be parsed carefully in order to ensure proper tokenization.

计算机代写|机器学习代写machine learning代考|Dimensionality Reduction and Matrix Factorization

Dimensionality reduction and matrix factorization fall in the general category of methods that are also referred to as latent factor models. Sparse and high-dimensional representations like text work well with some learning methods but not with others. Therefore, a natural question arises as whether one can somehow compress the data representation to express it in a smaller number of features. Since these features are not observed in the original data but represent hidden properties of the data, they are also referred to as latent features.
Dimensionality reduction is intimately related to matrix factorization. Most types of dimensionality reduction transform the data matrices into factorized form. In other words, the original data matrix $D$ can be approximately represented as a product of two or more matrices, so that the total number of entries in the factorized matrices is far fewer than the number of entries in the original data matrix. A common way of representing an $n \times d$ document-term matrix as the product of an $n \times k$ matrix $U$ and a $d \times k$ matrix $V$ is as follows:
$$
D \approx U V^T
$$
The value of $k$ is typically much smaller than $n$ and $d$. The total number of entries in $D$ is $n \cdot d$, whereas the total number of entries in $U$ and $V$ is only $(n+d) \cdot k$. For small values of $k$, the representation of $D$ in terms of $U$ and $V$ is much more compact. The $n \times k$ matrix $U$ contains the $k$-dimensional reduced representation of each document in its rows, and the $d \times k$ matrix $V$ contains the $k$ basis vectors in its columns. In other words, matrix factorization methods create reduced representations of the data with (approximate) linear transforms. Note that Equation $1.2$ is represented as an approximate equality. In fact, all forms of dimensionality reduction and matrix factorization are expressed as optimization models in which the error of this approximation is minimized. Therefore, dimensionality reduction effectively compresses the large number of entries in a data matrix into a smaller number of entries with the lowest possible error.

Popular methods for dimensionality reduction in text include latent semantic analysis, non-negative matrix factorization, probabilistic latent semantic analysis, and latent Dirichlet allocation. We will address most of these methods for dimensionality reduction and matrix factorization in Chapter 3 . Latent semantic analysis is the text-centric avatar of singular value decomposition.

Dimensionality reduction and matrix factorization are extremely important because they are intimately connected to the representational issues associated with text data. In data mining and machine learning applications, the representation of the data is the key in designing an effective learning method. In this sense, singular value decomposition methods enable high-quality retrieval, whereas certain types of non-negative matrix factorization methods enable high-quality clustering. In fact, clustering is an important application of dimensionality reduction, and some of its probabilistic variants are also referred to as topic models. Similarly, certain types of decision trees for classification show better performance with reduced representations. Furthermore, one can use dimensionality reduction and matrix factorization to convert a heterogeneous combination of text and another data type into multidimensional format (cf. Chapter 8).

机器学习代考

计算机代写|机器学习代写machine learning代考|Text Preprocessing and Similarity Computation

需要文本预处理将非结构化格式转换为结构化和多维表示。文本通常与大量无关数据同时出现，例如标签、锚文本和其他不相关的特征。此外，不同的词在文本域中具有不同的意义。例如，经常出现的词，如“a”、“an”、

和“the”对于文本挖掘目的意义不大。在许多情况下，由于时态或复数的选择，单词是彼此的变体。有些词只是拼写错误。将字符序列转换为单词序列（或标记）的过程称为标记化。请注意，一个词在文档中的每次出现都是一个标记，即使它在文档中出现不止一次。因此，同一个词出现三次将创建三个对应的标记。标记化过程通常需要大量关于手头特定语言的领域知识，因为不同语言中标点符号的变化无常导致单词边界存在歧义。
预处理原始文本的一些常见步骤如下：

文本提取：在文本源是 Web 的情况下，它会与各种其他类型的数据（例如锚点、标签等）结合使用。此外，在以 Web 为中心的设置中，特定页面可能包含（有用的）主要块和包含广告或不相关内容的其他块。从主块中提取有用的文本对于高质量挖掘很重要。这些类型的设置需要专门的解析和提取技术。
停用词去除：停用词是经常出现的词，对挖掘过程几乎没有辨别力。常用代词、冠词和介词被视为停用词。需要删除此类词以改进挖掘过程。
词干提取、大小写折叠和标点符号：具有共同词根的单词被合并为一个代表。例如，“sinking”和“sank”之类的词被合并为单个标记“sink”。单词第一个字母的大小写（即大写）对其语义解释可能重要也可能不重要。例如，“Rose”这个词可能是一朵花，也可能是一个人的名字，视情况而定。在其他情况下，大小写对于单词的语义解释可能并不重要，因为它是由特定于语法的约束（例如句子的开头）引起的。因此，需要特定语言的启发式方法来决定如何处理案例。需要仔细解析连字符等标点符号，以确保正确的标记化。

计算机代写|机器学习代写machine learning代考|Dimensionality Reduction and Matrix Factorization

降维和矩阵分解属于一般方法类别，也称为潜在因子模型。像文本这样的稀疏和高维表示适用于某些学习方法，但不适用于其他学习方法。因此，自然会出现一个问题，即是否可以以某种方式压缩数据表示以用较少的特征来表达它。由于这些特征在原始数据中没有观察到，而是代表数据的隐藏属性，因此它们也被称为潜在特征。
降维与矩阵分解密切相关。大多数类型的降维将数据矩阵转换为因式分解形式。也就是说，原始数据矩阵丁可以近似表示为两个或多个矩阵的乘积，因此分解矩阵中的条目总数远少于原始数据矩阵中的条目数。表示一个的常用方法n×d文档术语矩阵作为一个产品n×k矩阵在和一个d×k矩阵在如下：

丁≈在在吨
的价值k通常比n和d. 中的条目总数丁是n⋅d, 而条目总数在和在只是(n+d)⋅k. 对于小值k, 表示丁按照在和在更紧凑。这n×k矩阵在包含k- 每个文档在其行中的维减少表示，以及d×k矩阵在包含k列中的基向量。换句话说，矩阵分解方法使用（近似）线性变换创建数据的简化表示。请注意方程式1.2表示为近似相等。事实上，所有形式的降维和矩阵分解都表示为优化模型，其中这种近似的误差被最小化。因此，降维有效地将数据矩阵中的大量条目压缩为尽可能少的错误条目。

文本降维的流行方法包括潜在语义分析、非负矩阵分解、概率潜在语义分析和潜在 Dirichlet 分配。我们将在第 3 章中讨论这些用于降维和矩阵分解的方法中的大部分。潜在语义分析是以文本为中心的奇异值分解化身。

降维和矩阵分解非常重要，因为它们与文本数据相关的表征问题密切相关。在数据挖掘和机器学习应用中，数据的表示是设计有效学习方法的关键。从这个意义上说，奇异值分解方法可以实现高质量的检索，而某些类型的非负矩阵分解方法可以实现高质量的聚类。事实上，聚类是降维的一个重要应用，它的一些概率变体也被称为主题模型。同样，用于分类的某些类型的决策树在减少表示的情况下表现出更好的性能。此外，

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP5318

Posted on 2022年12月24日2022年12月24日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|What Is Special About Learning from Text

Most machine learning applications in the text domain work with the bag-of-words representation in which the words are treated as dimensions with values corresponding to word frequencies. A data set corresponds to a collection of documents, which is also referred to as a corpus. The complete and distinct set of words used to define the corpus is also referred to as the lexicon. Dimensions are also referred to as terms or features. Some applications of text work with a binary representation in which the presence of a term in a document corresponds to a value of 1 , and 0 , otherwise. Other applications use a normalized function of the word frequencies as the values of the dimensions. In each of these cases, the dimensionality of data is very large, and may be of the order of $10^5$ or even $10^6$. Furthermore, most values of the dimensions are $0 \mathrm{~s}$, and only a few dimensions take on positive values. In other words, text is a high-dimensional, sparse, and non-negative representation.

These properties of text create both challenges and opportunities. The sparsity of text implies that the positive word frequencies are more informative than the zeros. There is also wide variation in the relative frequencies of words, which leads to differential importance of the different words in mining applications. For example, a commonly occurring word like “the” is often less significant and needs to be down-weighted (or completely removed) with normalization. In other words, it is often more important to statistically normalize the relative importance of the dimensions (based on frequency of presence) compared to traditional multidimensional data. One also needs to normalize for the varying lengths of different documents while computing distances between them. Furthermore, although most multidimensional mining methods can be generalized to text, the sparsity of the representation has an impact on the relative effectiveness of different types of mining and learning methods. For example, linear support-vector machines are relatively effective on sparse representations, whereas methods like decision trees need to be designed and tuned with some caution to enable their accurate use. All these observations suggest that the sparsity of text can either be a blessing or a curse depending on the methodology at hand. In fact, some techniques such as sparse coding sometimes convert non-textual data to text-like representations in order to enable efficient and effective learning methods like support-vector machines [405].

计算机代写|机器学习代写machine learning代考|Analytical Models for Text

The section will provide a comprehensive overview of text mining algorithms and applications. The next chapter of this book primarily focuses on data preparation and similarity computation. Issues related to preprocessing issues of data representation are also discussed in this chapter. Aside from the first two introductory chapters, the topics covered in this book fall into three primary categories:

Fundamental mining applications: Many data mining applications like matrix factorization, clustering, and classification, can he used for any type of multidimensional data. Nevertheless, the uses of these methods in the text domain has specialized characteristics. These represent the core building blocks of the vast majority of text mining applications. Chapters 3 through 8 will discuss core data mining methods. The interaction of text with other data types will be covered in Chapter 8 .
Information retrieval and ranking: Many aspects of information retrieval and ranking are closely related to text mining. For example, ranking methods like ranking SVM and link-based ranking are often used in text mining applications. Chapter 9 will provide an overview of information retrieval methods from the point of view of text. mining.
Sequence- and natural language-centric text mining: Although multidimensional mining methods can be used for basic applications, the true power of mining text can be leveraged in more complex applications by treating text as sequences. Chapters 10 through 16 will discuss these advanced topics like sequence embedding, neural learning, information extraction, summarization, opinion mining, text segmentation, and event extraction. Many of these methods are closely related to natural language processing. Although this book is not focused on natural language processing, the basic building blocks of natural language processing will be used as off-the-shelf tools for text mining applications.

In the following, we will provide an overview of the different text mining models covered in this book. In cases where the multidimensional representation of text is used for mining purposes, it is relatively easy to use a consistent notation. In such cases, we assume that a document corpus with $n$ documents and $d$ different terms can be represented as a sparse $n \times d$ document-term matrix, which is typically very sparse. The $i$ th row of $D$ is represented by the $d$-dimensional row vector $\overline{X_i}$. One can also represent a document corpus as a set of these $d$-dimensional vectors, which is denoted by $\mathcal{D}=\left[\bar{X}_1 \ldots \bar{X}_n\right]$. This terminology will be used consistently throughout the book. Many information retrieval books prefer the use of a term-document matrix, which is the transpose of the document-term matrix and the rows correspond to the frequencies of terms. However, using a document-term matrix, in which data instances are rows, is consistent with the notations used in books on multidimensional data mining and machine learning. Therefore, we have chosen to use a document-term matrix in order to consistent with the broader literature on machine learning.

Much of the book will be devoted to data mining and machine learning rather than the database management issues of information retrieval. Nevertheless, there is some overlap between the two areas, as they are both related to problems of ranking and search engines. Therefore, a comprehensive chapter is devoted to information retrieval and search engines. Throughout this book, we will use the term “learning algorithm” as a broad umbrella term to describe any algorithm that discovers patterns from the data or discovers how such patterns may be used for predicting specific values in the data.

机器学习代考

计算机代写|机器学习代写machine learning代考|What Is Special About Learning from Text

文本域中的大多数机器学习应用程序都使用词袋表示，其中词被视为具有与词频相对应的值的维度。数据集对应于文档的集合，也称为语料库。用于定义语料库的完整且不同的单词集也称为词典。维度也称为术语或特征。文本的一些应用程序使用二进制表示，其中文档中的术语对应于值 1 ，否则为 0 。其他应用程序使用词频的归一化函数作为维度的值。在每一种情况下，数据的维数都非常大，可能是105甚至106. 此外，维度的大多数值是0 秒, 只有少数维度取正值。换句话说，文本是一种高维的、稀疏的、非负的表示。

文本的这些属性既带来了挑战，也带来了机遇。文本的稀疏性意味着正词频比零词频提供更多信息。单词的相对频率也存在很大差异，这导致不同单词在挖掘应用程序中的重要性不同。例如，像“the”这样经常出现的词通常不太重要，需要通过归一化来降低权重（或完全删除）。换句话说，与传统的多维数据相比，统计维度的相对重要性（基于出现频率）通常更为重要。在计算它们之间的距离时，还需要对不同文档的不同长度进行归一化。此外，尽管大多数多维挖掘方法都可以推广到文本，但表示的稀疏性会影响不同类型挖掘和学习方法的相对有效性。例如，线性支持向量机在稀疏表示上相对有效，而决策树等方法需要谨慎设计和调整以使其能够准确使用。所有这些观察结果表明，文本的稀疏性可能是福也可能是祸，这取决于手头的方法。事实上，某些技术（例如稀疏编码）有时会将非文本数据转换为类似文本的表示形式，以便实现高效且有效的学习方法，例如支持向量机 [405]。表示的稀疏性对不同类型的挖掘和学习方法的相对有效性有影响。例如，线性支持向量机在稀疏表示上相对有效，而决策树等方法需要谨慎设计和调整以使其能够准确使用。所有这些观察结果表明，文本的稀疏性可能是福也可能是祸，这取决于手头的方法。事实上，某些技术（例如稀疏编码）有时会将非文本数据转换为类似文本的表示形式，以便实现高效且有效的学习方法，例如支持向量机 [405]。表示的稀疏性对不同类型的挖掘和学习方法的相对有效性有影响。例如，线性支持向量机在稀疏表示上相对有效，而决策树等方法需要谨慎设计和调整以使其能够准确使用。所有这些观察结果表明，文本的稀疏性可能是福也可能是祸，这取决于手头的方法。事实上，某些技术（例如稀疏编码）有时会将非文本数据转换为类似文本的表示形式，以便实现高效且有效的学习方法，例如支持向量机 [405]。而像决策树这样的方法需要谨慎地设计和调整，以使其能够准确使用。所有这些观察结果表明，文本的稀疏性可能是福也可能是祸，这取决于手头的方法。事实上，某些技术（例如稀疏编码）有时会将非文本数据转换为类似文本的表示形式，以便实现高效且有效的学习方法，例如支持向量机 [405]。而像决策树这样的方法需要谨慎地设计和调整，以使其能够准确使用。所有这些观察结果表明，文本的稀疏性可能是福也可能是祸，这取决于手头的方法。事实上，某些技术（例如稀疏编码）有时会将非文本数据转换为类似文本的表示形式，以便实现高效且有效的学习方法，例如支持向量机 [405]。

计算机代写|机器学习代写machine learning代考|Analytical Models for Text

本节将全面概述文本挖掘算法和应用。本书的下一章主要关注数据准备和相似度计算。本章还讨论了与数据表示的预处理问题相关的问题。除了前两章介绍性的章节外，本书涵盖的主题分为三个主要类别：

基础挖掘应用：许多数据挖掘应用，如矩阵分解、聚类和分类，可以用于任何类型的多维数据。然而，这些方法在文本域中的使用具有特殊性。这些代表了绝大多数文本挖掘应用程序的核心构建块。第 3 章到第 8 章将讨论核心数据挖掘方法。文本与其他数据类型的交互将在第 8 章介绍。
信息检索和排序：信息检索和排序的许多方面都与文本挖掘密切相关。例如，排序 SVM 和基于链接的排序等排序方法经常用于文本挖掘应用程序。第 9 章将从文本的角度概述信息检索方法。矿业。
以序列和自然语言为中心的文本挖掘：虽然多维挖掘方法可用于基本应用程序，但通过将文本视为序列，可以在更复杂的应用程序中利用挖掘文本的真正力量。第 10 章到第 16 章将讨论这些高级主题，如序列嵌入、神经学习、信息提取、摘要、意见挖掘、文本分割和事件提取。其中许多方法与自然语言处理密切相关。虽然本书的重点不是自然语言处理，但自然语言处理的基本构建块将用作文本挖掘应用程序的现成工具。

下面，我们将概述本书涵盖的不同文本挖掘模型。在文本的多维表示用于挖掘目的的情况下，使用一致的表示法相对容易。在这种情况下，我们假设文档语料库n文件和d不同的术语可以表示为稀疏n×d文档术语矩阵，通常非常稀疏。这一世第排丁由d维行向量X一世¯. 也可以将文档语料库表示为一组这些d维向量，表示为丁=[X¯1…X¯n]. 该术语将在整本书中始终如一地使用。许多信息检索书籍更喜欢使用术语-文档矩阵，它是文档-术语矩阵的转置，行对应于术语的频率。但是，使用文档术语矩阵（其中数据实例为行）与多维数据挖掘和机器学习书籍中使用的符号一致。因此，我们选择使用文档术语矩阵，以便与更广泛的机器学习文献保持一致。

本书的大部分内容将致力于数据挖掘和机器学习，而不是信息检索的数据库管理问题。然而，这两个领域之间有一些重叠，因为它们都与排名和搜索引擎问题有关。因此，一个完整的章节专门介绍信息检索和搜索引擎。在本书中，我们将使用术语“学习算法”作为一个广义的总括术语来描述任何从数据中发现模式或发现如何使用这些模式来预测数据中的特定值的算法。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP4702

Posted on 2022年12月23日2022年12月23日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|Maximum Likelihood Estimation of Model Parameters

Having defined the density function, we can now reason more formally about what it means for a particular model to be a ‘good’ fit to the data. In other words, we would like to ask how likely a particular model is in terms of a given error distribution.

Specifically, the density function in Equation (2.21) gives us a means of assigning a probability (or likelihood) to a particular set of labels $y$, given features $X$, and a model $\theta$, under some particular error distribution (in this case a Gaussian):
$$
\mathcal{L}\theta(y \mid X)=\prod{i=1}^{|y|} \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{y_1-f_\theta(x)}{\sigma}\right)^2} .
$$
Essentially, we want to choose $\theta$ so as to maximize this likelihood. Intuitively our goal is to choose a value of $\theta$ that is consistent with this error distribution, that is, a model that makes many small errors and few large ones.

Precisely, we would like to find $\arg \max \theta \mathcal{L}\theta(y \mid X)$. This procedure (finding a model $\theta$ that maximizes the likelihood under some error distribution) is known as maximum likelihood estimation (MLE). We solve this by taking logarithms and removing irrelevant terms $(\pi, \sigma)$ :
$$
\begin{aligned}
\underset{\theta}{\arg \max } \mathcal{L}\theta(y \mid X) & =\underset{\theta}{\arg \max } \ell\theta(y \mid X) \
& =\underset{\theta}{\arg \max } \log \prod_{i=1}^{|y|} \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{y_i-f_\theta\left(x_i\right)}{\sigma}\right)^2} \
& =\underset{\theta}{\arg \max } \sum_i \log e^{-\frac{1}{2}\left(\frac{y_i-f_\theta\left(x_i\right.}{\sigma}\right)^2} \
& =\underset{\theta}{\arg \max }-\sum_i\left(y_i-f_\theta\left(x_i\right)\right)^2 \
& =\underset{\theta}{\arg \min } \sum_i\left(y_i-f_\theta\left(x_i\right)\right)^2 \
& =\underset{\theta}{\arg \min } \frac{1}{|y|} \sum_i\left(y_i-f_\theta\left(x_i\right)\right)^2 .
\end{aligned}
$$
Note crucially in the above equation that the maximum likelihood solution for $\theta$ under our Gaussian error model is precisely the MSE. This demonstrates the relationship between the MSE and MLE (which we summarize in fig. 2.6).

计算机代写|机器学习代写machine learning代考|The R2 Coefficient

Having motivated our choice of the MSE at some length, it is worth asking how low the MSE should be before we consider our model to be ‘good enough’?
This quantity turns out not to be well defined: the MSE will depend on the scale and variability of our data, and the difficulty of our task. For example, predicted ratings on a 5-point scale would likely have lower MSEs than predicted ratings on a 100-point scale; on the other hand, this might not be the case if ratings on a 100-point scale were highly concentrated (e.g., nearly all ratings were in the 92-95 range). Finally, the MSE in either setting could be higher simply due to a lack of available features that allow us to predict ratings accurately.

As such, we would like a calibrated measurement of model error. As we just argued, the MSE is related to the variance of the data: this relationship is easy to see as follows:
$$
\begin{aligned}
\bar{y} & =\frac{1}{|y|} \sum_i y_i, \
\operatorname{var}(y) & =\frac{1}{|y|} \sum_i\left(y_i-\bar{y}\right)^2,
\end{aligned}
$$

$$
\operatorname{MSE}\left(y, f_\theta(X)\right)=\frac{1}{|y|} \sum_i\left(y_i-f\left(x_i\right)\right)^2 .
$$
In other words, the MSF would he equal to the variance if we had a trivial predictor that always estimated $f\left(x_i\right)=\bar{y}^5$ Thus, the variance might be used as a way of normalizing the MSE:
$$
\operatorname{FVU}\left(y, f_\theta(X)\right)=\frac{\operatorname{MSE}\left(f, f_\theta(X)\right)}{\operatorname{var}(y)} .
$$
This quantity, known as the Fraction of Variance Unexplained (FVU), essentially measures the extent to which the model explains variability in the data, as compared to a predictor that always predicts the mean (i.e., one which explains no variability at all).

This quantity will now take a value between 0 and 1: 0 being a perfect classifier (MSE of zero) and 1 being a trivial classifier. ${ }^6$
Often, one reports the $R^2$ coefficient, which is simply 1 minus the FVU:
$$
R^2=1-\frac{\operatorname{MSE}\left(y, f_\theta(X)\right)}{\operatorname{var}(y)},
$$
which now takes a value of 1 for a perfect predictor and 0 for a trivial predictor. The name ‘ $R^2$ ‘ comes from a different way of deriving the same quantity, in terms of the correlation between the predictions and the labels. ${ }^7$

机器学习代考

计算机代写|机器学习代写machine learning代考|Maximum Likelihood Estimation of Model Parameters

定义了密度函数后，我们现在可以更正式地推理特定模型对数据的“良好”拟合意味着什么。换句话说，我们想询问特定模型在给定误差分布方面的可能性有多大。
具体来说，方程 (2.21) 中的密度函数为我们提供了一种将概率（或可能性) 分配给特定标签集的方法 $y$ ，给定特征 $X$ ，和一个模型 $\theta$ ，在某些特定的误差分布下 (在本例中为高斯分布) :
$$
\mathcal{L} \theta(y \mid X)=\prod i=1^{|y|} \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{y_1-f_\theta(x)}{\sigma}\right)^2} .
$$
本质上，我们要选择 $\theta$ 从而最大化这种可能性。直觉上我们的目标是选择一个值 $\theta$ 符合这种误差分布，即小误差多，大误差少的模型。
准确地说，我们想找到arg max $\theta \mathcal{L} \theta(y \mid X)$. 这个过程（寻找模型 $\theta$ 最大化某些误差分布下的似然）被称为最大似然估计 (MLE) 。我们通过取对数并删除不相关的项来解决这个问题 $(\pi, \sigma)$ :
$$
\underset{\theta}{\arg \max } \mathcal{L} \theta(y \mid X)=\underset{\theta}{\arg \max } \ell \theta(y \mid X) \quad=\underset{\theta}{\arg \max } \log \prod_{i=1}^{|y|} \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{y_i-f_\theta\left(x_i\right)}{\sigma}\right)^2}=\underset{\theta}{\arg \max }
$$
请注意，在上面的等式中，最重要的是最大似然解 $\theta$ 在我们的高斯误差模型下恰恰是MSE。这证明了MSE 和 MLE 之间的关系 (我们在图 $2.6$ 中对此进行了总结)。

计算机代写|机器学习代写machine learning代考|The R2 Coefficient

在一定程度上促使我们选择 MSE 之后，值得一问的是，在我们认为我们的模型“足够好”之前，MSE 应该有多低?
事实证明这个数量没有很好地定义：MSE 将取决于我们数据的规模和可变性，以及我们任务的难度。例如，5 分制的预测评级可能比 100 分制的预测评级具有更低的 MSE；另一方面，如果 100 分制的评分高度集中 (例如，几乎所有评分都在 92-95 范围内)，情况可能就不是这样了。最后，由于缺乏使我们能够准确预测收视率的可用特征，两种设置中的 MSE 可能更高。
因此，我们希望对模型误差进行校准测量。正如我们刚才所论证的，MSE 与数据的方差有关：这种关系很容易看出如下:
$$
\begin{gathered}
\bar{y}=\frac{1}{|y|} \sum_i y_i, \operatorname{var}(y)=\frac{1}{|y|} \sum_i\left(y_i-\bar{y}\right)^2, \
\operatorname{MSE}\left(y, f_\theta(X)\right)=\frac{1}{|y|} \sum_i\left(y_i-f\left(x_i\right)\right)^2 .
\end{gathered}
$$
换句话说，如果我们有一个总是估计的平凡预测器，MSF 将等于方差 $f\left(x_i\right)=\bar{y}^5$ 因此，方差可以用作标准化 MSE 的一种方式:
$$
\operatorname{FVU}\left(y, f_\theta(X)\right)=\frac{\operatorname{MSE}\left(f, f_\theta(X)\right)}{\operatorname{var}(y)} .
$$
这个量称为末解释方差分数 (FVU)，与始终预测均值的预测变量 (即根本不解释任何可变性的预测变量) 相比，它主要衡量模型解释数据可变性的程度。
这个数量现在将取一个介于 0 和 1 之间的值: 0 是一个完美的分类器 (MSE 为零)， 1 是一个普通的分类器。 6
通常，有人报告 $R^2$ 系数，就是 1 减去 FVU:
$$
R^2=1-\frac{\operatorname{MSE}\left(y, f_\theta(X)\right)}{\operatorname{var}(y)},
$$
现在，完美预测变量的值为 1 ，平凡预测变量的值为 0 。名字 ‘ $R^{2+}$ 就预测和标签之间的相关性而言，来自推导相同数菓的不同方式 ${ }^7$

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP30027

Posted on 2022年12月23日2022年12月23日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|Evaluating Regression Models

When developing the earlier linear models, we were somewhat imprecise about what is meant by a ‘line of best fit’ (or generally a model of best fit). Indeed, the pseudoinverse is not a ‘solution’ to the system of equations given in Equation (2.8), but is merely an approximation (naturally, the line of best fit does not pass through all points exactly).

Here, we would like to be more precise about what it means for a model to be ‘good.’ This is a key issue when fitting and evaluating any machine learning model: one needs a way of quantifying how closely a model fits the given data. Given a desired measure of success, we can compare alternative models against this measure and design optimization schemes that optimize the desired measure directly.

A commonly used evaluation criterion when evaluating regression algorithms is called the mean squared error, or MSE. The MSE between a model $f_\theta(X)$ and a set of labels $y$ is defined as
$$
\operatorname{MSE}\left(y, f_\theta(X)\right)=\frac{1}{|y|} \sum_{i=1}^{|y|}\left(f_\theta\left(x_i\right)-y_i\right)^2,
$$
in other words, the average squared difference between the model’s predictions and the labels. Often reported is also the root mean squared error (RMSE), that is, $\sqrt{\operatorname{MSE}\left(y, f_\theta(X)\right)}$; the RMSE is sometimes preferable as it is consistent in scale with the original labels.

With some effort, it can be shown that the linear model $f_\theta(X)$ that minimizes the MSE compared to the labels $y$ is given by using the pseudoinverse as in Equation (2.10). We leave this as an exercise (Exercise 2.6).

计算机代写|机器学习代写machine learning代考|Why the Mean Squared Error

Although the MSE has a convenient relationship with the pseudoinverse, it may otherwise seem a somewhat arbitrary choice of error measure. For instance, it may seem more obvious at first to compute an error measure such as the mean absolute error (or MAF):
$$
\operatorname{MAE}\left(y, f_\theta(X)\right)=\frac{1}{|y|} \sum_{i=1}^{|y|}\left|f_\theta\left(x_i\right)-y_i\right| \text {. }
$$
Or, why not count the number of times the model is wrong by more than one star? For that matter, why not measure the mean cubed error?

To defend the MSE as a reasonable choice, we need to characterize what types of errors are more ‘likely’ than others. Essentially, the MSE assigns very small penalties to small errors and very large penalties to large errors. This is in contrast to, say, the MAE, which assigns penalties precisely in proportion to how large the error is. What the MSE therefore seems to be assuming is that small errors are common and large errors are particularly uncommon.

What we are talking about informally here is a notion of how errors are distributed under some model. Formally, we say that the labels are equal to our model’s predictions, plus some error:
$$
y=\underbrace{f_\theta(X)}{\text {prediction }}+\underbrace{\epsilon}{\text {error }},
$$
and that our error follows some probability distribution. Our argument here said that small errors are common and large errors are very rare. This suggests that errors may be distributed following a bell curve, which we could capture with a Gaussian (or ‘Normal’) distribution:
$$
\epsilon \sim \mathcal{N}\left(0, \sigma^2\right) .
$$
The density function for a (zero mean) Gaussian distribution is given by
$$
f^{\prime}\left(x^{\prime}\right)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{2}{\sigma}\right)^2}
$$

机器学习代考

计算机代写|机器学习代写machine learning代考|Evaluating Regression Models

在开发早期的线性模型时，我们对”最佳拟合线”（或通常是最佳拟合模型) 的含义有些不准确。实际上，伪逆并不是方程式 (2.8) 中给出的方程组的 “解”，而只是一个近似值（自然地，最佳拟合线不会精确地通过所有点）。
在这里，我们想更准确地说明模型”好”的含义。这是拟合和评估任何机器学习模型时的一个关键问题：需要一种方法来量化模型与给定数据的拟合程度。给定所需的成功衡量标准，我们可以将替代模型与该衡量标准进行比较，并设计直接优化所需衡量标准的优化方案。
评估回归算法时常用的评估标准称为均方误差或 $\operatorname{MSE}$ 。模型之间的 $\operatorname{MSE} f_\theta(X)$ 和一组标签 $y$ 定义为
$$
\operatorname{MSE}\left(y, f_\theta(X)\right)=\frac{1}{|y|} \sum_{i=1}^{|y|}\left(f_\theta\left(x_i\right)-y_i\right)^2,
$$
换句话说，模型预测和标签之间的平均平方差。经常报告的也是均方根误差 (RMSE)，即 $\sqrt{\operatorname{MSE}\left(y, f_\theta(X)\right)}$; RMSE 有时更可取，因为它与原始标签的比例一致。
通过一些努力，可以证明线性模型 $f_\theta(X)$ 与标签相比最小化 MSE $y$ 通过使用等式 (2.10) 中的伪逆给出。我们将其留作练习(练习 2.6)。

计算机代写|机器学习代写machine learning代考|Why the Mean Squared Error

虽然 MSE 与伪逆有一个方便的关系，但它可能看起来有点随意选择误差度量。例如，一开始计算误差度量 (例如平均绝对误差 (或 MAF) ) 似乎更明显:
$$
\operatorname{MAE}\left(y, f_\theta(X)\right)=\frac{1}{|y|} \sum_{i=1}^{|y|}\left|f_\theta\left(x_i\right)-y_i\right| .
$$
或者，为什么不统计模型错误的次数超过一颗星呢? 就此而言，为什么不测量平均立方误差呢?
为了捍卫 MSE 是一个合理的选择，我们需要描述哪些类型的错误比其他错误更“可能”。本质上，MSE 对小错误分配非常小的惩罚，对大错误分配非常大的惩罚。这与 MAE 形成对比，后者根据错误的大小精确地分配征罚。因此，MSE 似乎假设小错误很常见，而大错误特别少见。
我们在这里非正式地讨论的是错误在某种模型下如何分布的概念。形式上，我们说标签等于我们模型的预测，加上一些误差:
$$
y=\underbrace{f_\theta(X)} \text { prediction }+\underbrace{\epsilon} \text { error, }
$$
并且我们的错误遵循某种概率分布。我们这里的论点是小错误很常见，大错误非常罕见。这表明错误可能会按照钟形曲线分布，我们可以用高斯（或“正态”) 分布来捕获它:
$$
\epsilon \sim \mathcal{N}\left(0, \sigma^2\right) .
$$
(零均值) 高斯分布的密度函数由下式给出
$$
f^{\prime}\left(x^{\prime}\right)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{2}{\sigma}\right)^2}
$$

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP5318

Posted on 2022年12月23日2022年12月23日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|Supervised Learning

All of the techniques presented in this chapter-and most of the personalization techniques we will explore throughout this book-are forms of supervised learning. Supervised learning techniques assume that our prediction tasks (or our datasets) can be separated into the following two components:
labels (denoted $y$ ) that we would like to predict, and features (denoted $X$ ) that we believe will help us to predict those labels. ${ }^1$
For example, given a sentiment analysis task (chap. 8), our data might be (the text of) reviews from Amazon or Yelp, and our labels would be the ratings associated with those reviews.

Given this distinction between features and labels in a dataset, the goal of a supervised learning algorithm is to infer the underlying function
$$
f(x) \rightarrow y
$$
that explains the relationship between the features and the labels. Usually, this function will be parameterized by model parameters $\theta$, that is,
$$
f_\theta(x) \rightarrow y .
$$
For example, in this chapter, $\theta$ might describe which features are positively or negatively correlated (or uncorrelated) with the labels; later, $\theta$ might capture the preferences of a particular user in a recommender system (chap. 5). Figure 2.1 explains how this type of supervised approach relates to other types of learning.

Throughout this chapter, we will assume that we are given labels in the form of a vector $y$ and features in the form of a matrix $X$, so that each $y_i$ is the label associated with the $i$ th observation and $x_i$ is a vector of features associated with that observation.

The two categories of supervised learning that we will cover in this and the next chapter include:

Regression, in which our goal is to predict real-valued labels $y$ as closely as possible (sec. 2.1). When building personalized models in later chapters,
such targets may include ratings, sentiment, the number of votes a social media post receives, or a patient’s heart rate.
Classification, in which $y$ is an element of a discrete set (chap. 3). In later chapters, these will correspond to outcomes such as whether a user clicks on or purchases an item. We will also see how such approaches can be adapted to learn rankings over items (sec. 3.3.3).

计算机代写|机器学习代写machine learning代考|Linear Regression

Perhaps the simplest association we could assume between our features $X$ and labels $y$ would be a linear relationship, that is, the relationship between $X$ and $y$ is defined as
$$
y=X \theta .
$$
Using our notation from Equation (2.2):
$$
f_\theta(X)=X \theta,
$$
or equivalently for a single observation $x_i$ (a row of $X$ )
$$
f_\theta(x)=x_i \cdot \theta=\sum_k x_{i k} \theta_i .
$$
Here $\theta$ is our set of model parameters: a vector of unknowns that describes which features are relevant to predicting the labels.

Ignoring strict notation for now, a trivial example might consist of predicting a review’s rating as a function of its length. To do so, let us consider a small dataset of 100 (length, rating) pairs from Goodreads fantasy novels (Wan and McAuley, 2018). Figure $2.2$ plots the relationship between review length (in characters) and the rating.

From Figure 2.2, there appears to be a (rough) association between ratings and review length, that is, more positive reviews tend to be longer. A very simple model might attempt to describe that relationship with a line, that is,
$$
\text { rating } \simeq \theta_0+\theta_1 \times \text { (review length). }
$$
Note that Equation (2.6) is just the standard equation for a line $(y=m x+b)$, where $\theta_1$ is a slope and $\theta_0$ is an intercept.

If we can identify a line that approximately describes this relationship, we can use it to estimate a rating from a given review, even though we may never have seen a review of some specific length before. In this sense, the line is a simple model of the data, as it allows us to predict labels from previously unseen features. To do so, we formalize the problem of finding a line of best fit. Specifically, we are interested in identifying the values of $\theta_0$ and $\theta_1$ that most closely match the trend in Figure 2.2. To solve for $\theta=\left[\theta_0, \theta_1\right]$, we can write out the problem as a system of equations in matrix form:
$$
y \simeq X \cdot \theta,
$$
where $y$ is our vector of observed ratings and $X$ is our matrix of observed features (in this case the reviews’ lengths).

机器学习代考

计算机代写|机器学习代写machine learning代考|Supervised Learning

本章介绍的所有技术一一以及我们将在本书中探索的大部分个性化技术一一都是监督学习的形式。监督学习技术假设我们的预测任务 (或我们的数据集) 可以分为以下两个部分:
标签 (表示 $y$ )，我们想预测，和特征（表示 $X$ ) 我们相信这将帮助我们预测这些标签。 1 例如，给定一个情感分析任务（第 8 章)，我们的数据可能是来自 Amazon 或 Yelp 的评论（的文本），而我们的标签将是与这些评论相关的评级。
鉴于数据集中特征和标签之间的这种区别，监督学习算法的目标是推断底层函数
$$
f(x) \rightarrow y
$$
这解释了特征和标签之间的关系。通常，这个函数会被模型参数参数化 $\theta$ ，那是，
$$
f_\theta(x) \rightarrow y .
$$
例如，在本章中， $\theta$ 可能描述哪些特征与标签正相关或负相关（或不相关）；之后， $\theta$ 可能会在推荐系统中捕获特定用户的偏好 (第 5 章) 。图 $2.1$ 解释了这种监督方法如何与其他类型的学习相关联。
在本章中，我们假设我们得到的是向量形式的标签 $y$ 和矩阵形式的特征 $X$ ，这样每个 $y_i$ 是与关联的标签 $i$ 第观察和 $x_i$ 是与该观察相关的特征向量。
我们将在本章和下一章中介绍的两类监督学习包括:

回归，我们的目标是预测实值标签 $y$ 尽可能接近（第 $2.1$ 节) 。在后面的章节中构建个性化模型时，
这些目标可能包括评级、情绪、社交媒体帖子获得的投票数或患者的心率。
分类，其中 $y$ 是离散集的一个元素（第 3 章) 。在后面的章节中，这些将对应于结果，例如用户是否点击或购买了商品。我们还将看到如何调整这些方法来学习项目排名（第 $3.3 .3$ 节)。

计算机代写|机器学习代写machine learning代考|Linear Regression

也许我们可以在我们的特征之间假设最简单的关联 $X$ 和标签 $y$ 将是线性关系，即之间的关系 $X$ 和 $y$ 定义为
$$
y=X \theta \text {. }
$$
使用方程 (2.2) 中的符号:
$$
f_\theta(X)=X \theta
$$
或等效于单个观察 $x_i($ (排 $X$ )
$$
f_\theta(x)=x_i \cdot \theta=\sum_k x_{i k} \theta_i .
$$
这里 $\theta$ 是我们的模型参数集：描述哪些特征与预测标签相关的末知向量。
现在忽略严格的符号，一个简单的例子可能包括预测评论的评级作为其长度的函数。为此，让我们考虑来自 Goodreads 奇幻小说 (Wan 和 McAuley，2018) 的 100 对 (长度，评分) 对的小型数据集。数字 $2.2$ 绘制评论长度 (以字符为单位) 与评分之间的关系。
从图 $2.2$ 中可以看出，评分和评论长度之间似乎存在 (粗略的) 关联，即更多正面评论往往更长。一个非常简单的模型可能会尝试用一条线来描述这种关系，即
$$
\text { rating } \simeq \theta_0+\theta_1 \times \text { (review length) } .
$$
请注意，方程 (2.6) 只是直线的标准方程 $(y=m x+b)$ ，在哪里 $\theta_1$ 是一个斜坡并且 $\theta_0$ 是截距。
如果我们可以确定一条线来大致描述这种关系，我们就可以用它来估计给定评论的评级，即使我们以前可能从末见过某个特定长度的评论。从这个意义上说，这条线是数据的一个简单模型，因为它允许我们从以前看不见的特征中预测标签。为此，我们将寻找最佳拟合线的问题形式化。具体来说，我们有兴趣确定的价值 $\theta_0$ 和 $\theta_1$ 最符合图 $2.2$ 中的趋势。解决 $\theta=\left[\theta_0, \theta_1\right]$ ，我们可以将问题写成矩阵形式的方程组:
$$
y \simeq X \cdot \theta,
$$ 在哪里 $y$ 是我们观察到的评级向量，并且 $X$ 是我们观察到的特征矩阵（在本例中是评论的长度）。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP4702

Posted on 2022年12月21日2022年12月21日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|Decision fn is degree-p polynomial

E.g., a cubic in $\mathbb{R}^2$ :
$$
\begin{aligned}
& \Phi(x)=\left[\begin{array}{lllllllll}
x_1^3 & x_1^2 x_2 & x_1 x_2^2 & x_2^3 & x_1^2 & x_1 x_2 & x_2^2 & x_1 & x_2
\end{array}\right]^{\top} \
& \Phi(x): \mathbb{R}^d \rightarrow \mathbb{R}^{O(d r)}
\end{aligned}
$$
[Now we’re really blowing up the number of features! If you have, say, 100 features per sample point and you want to use degree-4 decision functions, then each lifted feature vector has a length of roughly 4 million, and your learning algorithm will take approximately forever to run.]
[However, later in the semester we will learn an extremely clever trick that allows us to work with these huge feature vectors very quickly, without ever computing them. It’s called “kernelization” or “the kernel trick.” So even though it appears now that working with degree- 4 polynomials is computationally infeasible, it can actually be done quickly.]

[Increasing the degree like this accomplishes two things.

First, the data might become linearly separable when you lift them to a high enough degree, even if the original data are not linearly separable.
Second, raising the degree can widen the margin, so you might get a more robust decision boundary that generalizes better to test data.

However, if you raise the degree too high, you will overfit the data and then generalization will get worse.]

[You should search for the ideal degree -not too small, not too big. It’s a balancing act between underfitting and overfitting. The degree is an example of a hyperparameter that can be optimized by validation.]
[If you’re using both polynomial features and a soft-margin SVM, now you have two hyperparameters: the degree and the regularization hyperparameter $C$. Generally, the optimal $C$ will be different for every polynomial degree, so when you change the degree, you have to run validation again to find the best $C$ for that degree.]
[So far I’ve talked only about polynomial features. But features can get much more complicated than polynomials, and they can be tailored to fit a specific problem. Let’s consider a type of feature you might use if you wanted to implement, say, a handwriting recognition algorithm.]

计算机代写|机器学习代写machine learning代考|Machine Learning Abstractions and Numerical Optimization

[When you write a large computer program, you break it down into subroutines and modules. Many of you know from experience that you need to have the discipline to impose strong abstraction barriers between different modules, or your program will become so complex you can no longer manage nor maintain it.]
[When you learn a new subject, it helps to have mental abstraction barriers, too, so you know when you can replace one approach with a different approach. I want to give you four levels of abstraction that can help you think about machine learning. It’s important to make mental distinctions between these four things, and the code you write should have modules that reflect these distinctions as well.]

[In this course, we focus primarily on the middle two levels. As a data scientist, you might be given an application, and your challenge is to turn it into an optimization problem that we know how to solve. We will talk a bit about optimization algorithms, but usually you’ll use an optimization code that’s faster and more robust than what you would write yourself.]
[The second level, the model, has a huge effect on the success of your learning algorithm. Sometimes you get a big improvement by tailoring the model or its features to fit the structure of your specific data. The model also has a big effect on whether you overfit or underfit. And if you want a model that you can interpret so you can do inference, the model has to have a simple structure. Lastly, you have to pick a model that leads to an optimization problem that can be solved. Some optimization problems are just too hard.]
[It’s important to understand that when you change something in one level of this diagram, you probably have to change all the levels underneath it. If you switch your model from a linear classifier to a neural net, your optimization problem changes, and your optimization algorithm changes too.]

[Not all machine learning methods fit this four-level decomposition. Nevertheless, for everything you learn in this class, think about where it fits in this hierarchy. If you don’t distinguish which math is part of the model and which math is part of the optimization algorithm, this course will be very confusing for you.]

机器学习代考

计算机代写|机器学习代写machine learning代考|Decision fn is degree-p polynomial

例如，一个立方体 $\mathbb{R}^2$ :
[现在我们真的在炸毁功能的数量! 如果你有，比如说，每个样本点有 100 个特征，并且你想使用 4 阶决策函数，那么每个提升特征向量的长度大约为 400 万，你的学习算法将需要大约永远运行。]
[但是，在本学期的晩些时候，我们将学习一个非常聪明的技巧，使我们能够非常快速地处理这些巨大的特征向量，而无需计算它们。它被称为“内核化“或“内核技巧”。因此，尽管现在看来使用 4 次多项式在计算上是不可行的，但实际上可以很快完成。]
[像这样增加度数可以完成两件事。

首先，当您将数据提升到足够高的程度时，数据可能会变得线性可分，即使原始数据不是线性可分的。
其次，提高度数可以扩大边际，因此您可能会得到更强大的决策边界，可以更好地泛化到测试数据。
然而，如果你把度提高得太高，你会过度拟合数据，然后泛化会变得更糟。]
[你应该寻找理想的程度一一不要太小，也不要太大。这是欠拟合和过度拟合之间的平衡行为。度数是可以通过验证优化的超参数示例。]
[如果您同时使用多项式特征和软间隔 SVM，那么现在您有两个超参数：度数和正则化超参数 C. 一般来说，最优 $C$ 每个多项式的次数都会不同，所以当你改变次数时，你必须再次运行验证才能找到最好的 $C$ ] [到目前为止，我只讨论了多项式特征。但是特征可能比多项式复杂得多，并且可以对其进行定制以适应特定问题。让我们考虑一种你可能会使用的特性，如果你想实现，比如说，手写识别算法。]

计算机代写|机器学习代写machine learning代考|Machine Learning Abstractions and Numerical Optimization

[当你写一个大型计算机程序时，你把它分解成子程序和模块。你们中的许多人从经验中知道，您需要遵守纪律，在不同模块之间强加强大的抽象障碍，否则您的程序将变得如此复杂，以至于您无法再管理或维护它。]
[当您学习一门新学科时，它有助于也有心理抽象障碍，所以你知道什么时候可以用另一种方法代替一种方法。我想给你四个抽象层次来帮助你思考机器学习。区分这四件事很重要，您编写的代码也应该有反映这些区别的模块。]

[在本课程中，我们主要关注中间两个级别。作为一名数据科学家，您可能会得到一个应用程序，而您的挑战是将其转化为我们知道如何解决的优化问题。我们将讨论一些优化算法，但通常您会使用比您自己编写的代码更快、更健壮的优化代码。]
[第二个层次，模型，对你学习算法的成功有巨大的影响。有时，您可以通过定制模型或其特征来适应特定数据的结构，从而获得很大的改进。该模型对您是否过拟合或欠拟合也有很大影响。如果您想要一个可以解释的模型以便进行推理，那么该模型必须具有简单的结构。最后，您必须选择一个可以解决优化问题的模型。有些优化问题实在是太难了。]
[重要的是要理解，当您更改此图的一个级别中的某些内容时，您可能必须更改它下面的所有级别。如果你将你的模型从线性分类器切换到神经网络，你的优化问题就会改变，你的优化算法也会改变。]

[并非所有的机器学习方法都适合这种四级分解。尽管如此，对于您在本课程中学到的所有内容，请考虑它在该层次结构中的位置。如果您不区分哪些数学是模型的一部分，哪些数学是优化算法的一部分，那么本课程会让您感到非常困惑。]

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP30027

Posted on 2022年12月21日2022年12月21日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|Soft-Margin Support Vector Machines; Features

Idea: Allow some points to violate the margin, with slack variables.
Modified constraint for point $i$ :
$$
y_i\left(X_i \cdot w+\alpha\right) \geq 1-\xi_i
$$
[Observe that the only difference between these constraints and the hard-margin constraints we saw last lecture is the extra slack term $\xi_{i \cdot}$.]
[We also impose new constraints, that the slack variables are never negative.]
$$
\xi_i \geq 0
$$
[This inequality ensures that all sample points that don’t violate the margin are treated the same; they all have $\xi_i=0$. Point $i$ has nonzero $\xi_i$ if and only if it violates the margin.]

[One way to think about slack is to pretend that slack is money we can spend to buy permission for a sample point to violate the margin. The further a point penetrates the margin, the bigger the fine you have to pay. We want to make the margin as wide as possible, but we also want to spend as little money as possible. If the regularization parameter $C$ is small, it means we’re willing to spend lots of money on violations so we can get a wider margin. If $C$ is big, it means we’re cheap and we won’t pay much for violations, even though we’ll suffer a narrower margin. If $C$ is infinite, we’re back to a hard-margin SVM.]

计算机代写|机器学习代写machine learning代考|Axis-aligned ellipsoid/hyperboloid decision boundaries

[Draw examples of axis-aligned ellipse \& hyperbola.]
In 3D, these have the formula
$$
A x_1^2+B x_2^2+C x_3^2+D x_1+E x_2+F x_3+\alpha=0
$$
[Here, the capital letters are scalars, not matrices.]
$$
\begin{aligned}
& \Phi: \mathbb{R}^d \rightarrow \mathbb{R}^{2 d} \
& \Phi(x)=\left[\begin{array}{llllll}
x_1^2 & \ldots & x_d^2 & x_1 & \ldots & x_d
\end{array}\right]^{\top}
\end{aligned}
$$
[We’ve turned $d$ input features into $2 d$ features for our linear classifier. If the points are separable by an axis-aligned ellipsoid or hyperboloid, per the formula above, then the points lifted to $\Phi$-space are separable by a hyperplane whose normal vector is $\left[\begin{array}{llllll}A & B & C & D & E & F\end{array}\right]$.

[Draw example of non-axis-aligned ellipse.]
3D formula: [for a general ellipsoid or hyperboloid]
$$
\begin{aligned}
& A x_1^2+B x_2^2+C x_3^2+D x_1 x_2+E x_2 x_3+F x_3 x_1+G x_1+H x_2+I x_3+\alpha=0 \
& \Phi(x): \mathbb{R}^d \rightarrow \mathbb{R}^{\left(d^2+3 d\right) / 2}
\end{aligned}
$$
[Now, our decision function can be any degree- 2 polynomial.]
Isosurface defined by this equation is called a quadric. [In the special case of two dimensions, it’s also known as a conic section. So our decision boundary can be an arbitrary conic section.]
[You’ll notice that there is a quadratic blowup in the number of features, because every pair of input features creates a new feature in $\Phi$-space. If the dimension is large, these feature vectors are getting huge, and that’s going to impose a serious computational cost. But it might be worth it to find good classifiers for data that aren’t linearly separable.]

机器学习代考

计算机代写|机器学习代写machine learning代考|Soft-Margin Support Vector Machines; Features

想法：允许一些点违反边界，有松弛变量。
点的修改约束 $i$ :
$$
y_i\left(X_i \cdot w+\alpha\right) \geq 1-\xi_i
$$
[请注意，这些约束与我们上节课看到的硬边距约束之间的唯一区别是额外的松弛项 $\xi_{i \cdot .}$ ] [我们还施加了新的约束，松弛变量永远不会为负。]
$$
\xi_i \geq 0
$$
[这种不等式确保所有不违反边界的样本点都被同等对待；他们都有 $\xi_i=0$. 观点 $i$ 有非零 $\xi_i$ 当且仅当它违反了保证金。]
[考虑松弛的一种方法是假装松肔是我们可以花钱购买允许样本点违反保证金的钱。一个点穿透边缘越远，你必须支付的罚款就越大。我们布望利润尽可能大，但我们也希望花尽可能少的钱。如果正则化参数 $C$ 很小，这意味着我们愿意在违规行为上花很多钱，这样我们就可以获得更大的利润。如果 $C$ 很大，这意味着我们很便宜，我们不会为违规行为支付太多费用，即使我们会遭受更小的利润。如果 $C$ 是无限的，我们又回到了硬边距 SVM。 ]

计算机代写|机器学习代写machine learning代考|Axis-aligned ellipsoid/hyperboloid decision boundaries

[绘制轴对齐椭圆 \& 双曲线的示例。]
在 3D 中，这些具有公式
$$
A x_1^2+B x_2^2+C x_3^2+D x_1+E x_2+F x_3+\alpha=0
$$
[这里，大写字母是标量，不是矩阵。]
$$
\Phi: \mathbb{R}^d \rightarrow \mathbb{R}^{2 d} \quad \Phi(x)=\left[\begin{array}{llllll}
x_1^2 & \ldots & x_d^2 & x_1 & \ldots & x_d
\end{array}\right]^{\top}
$$
[我们已经转 $d$ 输入特征到 $2 d$ 我们的线性分类器的特征。如果根据上面的公式，这些点可以被轴对齐的椭圆体或双曲面分开，那么这些点被提升到 $\Phi$-空间可由法向量为的超平面分离 $\left[\begin{array}{llllll}A & B & C & D & E & F\end{array}\right]$.
[绘制非轴对齐椭圆的示例。]
3D 公式：[对于一般椭圆体或双曲面]
$$
A x_1^2+B x_2^2+C x_3^2+D x_1 x_2+E x_2 x_3+F x_3 x_1+G x_1+H x_2+I x_3+\alpha=0 \quad \Phi(x): \mathbb{R}^d
$$
[现在，我们的决策函数可以是任何 2 次多项式。]
由这个方程定义的等值面称为二次曲面。[在二维的特殊情况下，它也称为圆雉曲线。所以我们的决策边界可以是任意的圆雉曲线。]
[你会注意到特征数量呈二次膨胀，因为每对输入特征都会在 $\Phi$-空间。如果维度很大，这些特征向量就会变得很大，这将带来严重的计算成本。但是为不可线性分离的数据找到好的分类器可能是值得的。]

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP5318

Posted on 2022年12月21日2022年12月21日 by statistics-lab

如果你也在怎样代写机器学习 machine learning这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的机器学习 machine learning及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

计算机代写|机器学习代写machine learning代考|Linear Classifiers and Perceptrons

You are given sample of $n$ observations, each with $d$ features [aka predictors]. Some observations belong to class $\mathrm{C}$; some do not.
Example: Observations are bank loans
Features are income \& age $(d=2)$
Some are in class “defaulted,” some are not
Goal: Predict whether future borrowers will default, based on their income \& age.
Represent each observation as a point in $d$-dimensional space, called a sample point / a feature vector / independent variables.

[We draw these lines/curves separating C’s from $\mathrm{X}$ ‘s. Then we use these curves to predict which future borrowers will default. In the last example, though, we’re probably overfitting, which could hurt our predictions.]
decision boundary: the boundary chosen by our classifier to separate items in the class from those not. overfitting: When sinuous decision boundary fits sample points so well that it doesn’t classify future points well.
[A reminder that underlined phrases are definitions, worth memorizing.]
Some (not all) classifiers work by computing a
decision function: A function $f(x)$ that maps a point $x$ to a scalar such that
$\begin{array}{ll}f(x)>0 & \text { if } x \in \text { class } \mathrm{C} \ f(x) \leq 0 & \text { if } x \notin \text { class C. }\end{array}$
Aka predictor function.
For these classifiers, the decision boundary is $\left{x \in \mathbb{R}^d: f(x)=0\right}$
[That is, the set of all points where the decision function is zero.]
Usually, this set is a $(d-1)$-dimensional surface in $\mathbb{R}^d$.
${x: f(x)=0}$ is also called an isosurface of $f$ for the isovalue 0 .
$f$ has other isosurfaces for other isovalues, e.g., ${x: f(x)=1}$.

计算机代写|机器学习代写machine learning代考|Perceptron Learning; Maximum Margin Classifiers

Recall:

linear decision $\mathrm{fn} f(x)=w \cdot x$
(for simplicity, no $\alpha$ )
decision boundary ${x: f(x)=0}$
(a hyperplane through the origin)
sample points $X_1, X_2, \ldots, X_n \in \mathbb{R}^d$; class labels $y_1, \ldots, y_n=\pm 1$
goal: find weights $w$ such that $y_i X_i \cdot w \geq 0$
goal, revised: find $w$ that minimizes $R(w)=\sum_{i \in V}-y_i X_i \cdot w$
[risk function] where $V$ is the set of indices $i$ for which $y_i X_i \cdot w<0$.
[Our original problem was to find a separating hyperplane in one space, which I’ll call $x$-space. But we’ve transformed this into a problem of finding an optimal point in a different space, which I’ll call w-space. It’s important to understand transformations like this, where a geometric structure in one space becomes a point in another space.]
Point $x$ lies on hyperplane ${z: w \cdot z=0} \Leftrightarrow w \cdot x=0 \Leftrightarrow$ point $w$ lies on hyperplane ${z: x \cdot z=0}$ in $w$-space.
[So a hyperplane transforms to a point that represents its normal vector. And a sample point transforms to the hyperplane whose normal vector is the sample point.]
[In this algorithm, the transformations happen to be symmetric: a hyperplane in $x$-space transforms to a point in $w$-space the same way that a hyperplane in $w$-space transforms to a point in $x$-space. That won’t always be true for the decision boundaries we use this semester.]
If we want to enforce inequality $x \cdot w \geq 0$, that means
in $x$-space, $x$ should be on the same side of ${z: w \cdot z=0}$ as $w$

机器学习代考

计算机代写|机器学习代写machine learning代考|Linear Classifiers and Perceptrons

你得到的样品 $n$ 观察，每一个 $d$ 特征 [aka 预测因子]。一些观察属于类C; 有些没有。
示例: 观察值是银行贷款
特征是收入 $1 \&$ 年龄 $(d=2)$
有些属于“违约”类别，有些则不是
目标: 根据收入和年龄预测末来借款人是否会违约。
将每个观察结果表示为 $d$ 维空间，称为样本点/特征向量/自变量。
[我们绘制这些线/曲线将 C 与X的。然后我们使用这些曲线来预测末来哪些借款人会违约。不过，在最后一个示例中，我们可能过度拟合，这可能会影响我们的预测。]
决策边界：我们的分类器选择的边界，用于将类中的项目与非类中的项目分开。过度拟合：当曲折的决策边界非常适合样本点时，它不能很好地对末来的点进行分类。
[提醒下划线的短语是定义，值得记住。]
一些 (不是全部) 分类器通过计算
决策函数来工作: 一个函数 $f(x)$ 映射一个点 $x$ 到这样的标量
$f(x)>0 \quad$ if $x \in$ class $\mathrm{C} f(x) \leq 0 \quad$ if $x \notin$ class C.
又名预测函数。
对于这些分类器，决策边界是 $\backslash$ left $\left{x \backslash i\right.$ \mathbb $\left.{R}^{\wedge} d: f(x)=0 \backslash r i g h t\right}$
[即决策函数为零的所有点的集合。]
通常，这个集合是一个 $(d-1)$-维表面 $\mathbb{R}^d$.
$x: f(x)=0$ 也称为等值面 $f$ 对于等值 0 。
$f$ 具有其他等值的其他等值面，例如， $x: f(x)=1$.

计算机代写|机器学习代写machine learning代考|Perceptron Learning; Maximum Margin Classifiers

记起：

线性决策 $\mathrm{fn} f(x)=w \cdot x$
(为简单起见，不 $\alpha$ )
决策边界 $x: f(x)=0$
(通过原点的超平面)
样本点 $X_1, X_2, \ldots, X_n \in \mathbb{R}^d$; 类标签 $y_1, \ldots, y_n=\pm 1$
目标: 找到权重 $w$ 这样 $y_i X_i \cdot w \geq 0$
目标，修订: 找到 $w$ 最小化 $R(w)=\sum_{i \in V}-y_i X_i \cdot w$
[风险函数] 其中 $V$ 是指数集 $i$ 为了哪个 $y_i X_i \cdot w<0$.
[我们最初的问题是在一个空间中找到一个分离超平面，我称之为 $x$-空间。但我们已将其转化为在不同空间中寻找最佳点的问题，我将其称为 w 空间。理解这样的变换很重要，一个空间中的几何结构变成另一个空间中的一个点。]
观点 $x$ 位于超平面上 $z: w \cdot z=0 \Leftrightarrow w \cdot x=0 \Leftrightarrow$ 观点 $w$ 位于超平面上 $z: x \cdot z=0$ 在 $w$-空间。
[所以一个超平面变换到一个代表它的法向量的点。并且样本点变换到法向量为样本点的超平面。]
[在这个算法中，变换恰好是对称的: 一个超平面在 $x$-空间变换到一个点 $w$-空间与超平面相同 $w$-空间变换到一个点 $x$-空间。对于我们本学期使用的决策边界，情况并非总是如此。]
如果我们想加强不平等 $x \cdot w \geq 0$, 这意味着
在 $x$-空间， $x$ 应该在同一侧 $z: w \cdot z=0$ 作为 $w$

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写