标签： Convergence Analysis收敛分析

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|ISE633

Posted on 2022年12月1日2022年12月1日 by statistics-lab

如果你也在怎样代写机器学习中的优化理论Optimization for Machine Learningy CSC4512这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。机器学习中的优化理论Optimization for Machine Learningy是致力于解决优化问题的数学分支。优化问题是我们想要最小化或最大化函数值的数学函数。这些类型的问题在计算机科学和应用数学中大量存在。

机器学习中的优化理论Optimization for Machine Learningy每个优化问题都包含三个组成部分：目标函数、决策变量和约束。当人们谈论制定优化问题时，它意味着将“现实世界”问题转化为包含这三个组成部分的数学方程和变量。目标函数，通常表示为 f 或 z，反映要最大化或最小化的单个量。交通领域的例子包括“最小化拥堵”、“最大化安全”、“最大化可达性”、“最小化成本”、“最大化路面质量”、“最小化排放”、“最大化收入”等等。

机器学习中的优化理论Optimization for Machine Learningy代写，免费提交作业要求，满意后付款，成绩80\%以下全额退款，安全省心无顾虑。专业硕博写手团队，所有订单可靠准时，保证 100% 原创。最高质量的机器学习中的优化理论Optimization for Machine Learningy作业代写，服务覆盖北美、欧洲、澳洲等国家。在代写价格方面，考虑到同学们的经济条件，在保障代写质量的前提下，我们为客户提供最合理的价格。由于作业种类很多，同时其中的大部分作业在字数上都没有具体要求，因此机器学习中的优化理论Optimization for Machine Learningy作业代写的价格不固定。通常在专家查看完作业要求之后会给出报价。作业难度和截止日期对价格也有很大的影响。

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|ISE633

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Least Squares

The most important gradient formula is the one of the square loss (3), which can be obtained by expanding the norm
$$
\begin{aligned}
f(x+\varepsilon) &=\frac{1}{2}\left|A x-y+A \varepsilon\left|^2=\frac{1}{2}|A x-y|+\langle A x-y, A \varepsilon\rangle+\frac{1}{2} \mid A \varepsilon\right|^2\right.\
&=f(x)+\left\langle\varepsilon, A^{\top}(A x-y)\right\rangle+o(|\varepsilon|) .
\end{aligned}
$$
Here, we have used the fact that $|\left. A \varepsilon\right|^2=o(|\varepsilon|)$ and use the transpose matrix $A^{\top}$. This matrix is obtained by exchanging the rows and the columns, i.e. $A^{\top}=\left(A_{j, i}\right){i=1, \ldots, n}^{j=1, \ldots}$, but the way it should be remember and used is that it obeys the following swapping rule of the inner product, $$ \forall(u, v) \in \mathbb{R}^p \times \mathbb{R}^n, \quad\langle A u, v\rangle{\mathbb{R}^n}=\left\langle u, A^{\top} v\right\rangle_{\mathbb{R}^p}
$$
Computing gradient for function involving linear operator will necessarily requires such a transposition step. This computation shows that
$$
\nabla f(x)=A^{\top}(A x-y)
$$
This implies that solutions $x^{\star}$ minimizing $f(x)$ satisfies the linear system $\left(A^{\top} A\right) x^{\star}=A^{\top} y$. If $A^{\star} A \in \mathbb{R}^{p \times p}$ is invertible, then $f$ has a single minimizer, namely
$$
x^{\star}=\left(A^{\top} A\right)^{-1} A^{\top} y .
$$
This shows that in this case, $x^{\star}$ depends linearly on the data $y$, and the corresponding linear operator $\left(A^{\top} A\right)^{-1} A^{\star}$ is often called the Moore-Penrose pseudo-inverse of $A$ (which is not invertible in general, since typically $p \neq n$ ). The condition that $A^{\top} A$ is invertible is equivalent to ker $(A)={0}$, since
$$
A^{\top} A x=0 \quad \Longrightarrow \quad \mid A x |^2=\left\langle A^{\top} A x, x\right\rangle=0 \quad \Longrightarrow \quad A x=0 .
$$
In particular, if $n<p$ (under-determined regime, there is too much parameter or too few data) this can never holds. If $n \geqslant p$ and the features $x_i$ are “random” then $\operatorname{ker}(A)={0}$ with probability one. In this overdetermined situation $n \geqslant p, \operatorname{ker}(A)={0}$ only holds if the features $\left{a_i\right}_{i=1}^n$ spans a linear space $\operatorname{Im}\left(A^{\top}\right)$ of dimension strictly smaller than the ambient dimension $p$.

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Link with PCA

Let us assume the $\left(a_i\right){i=1}^n$ are centered, i.e. $\sum_i a_i=0$. If this is not the case, one needs to replace $a_i$ by $a_i-m$ where $m \stackrel{\text { def. }}{=} \frac{1}{n} \sum{i=1}^n a_i \in \mathbb{R}^p$ is the empirical mean. In this case, $\frac{C}{n}=A^{\top} A / n \in \mathbb{R}^{p \times p}$ is the empirical covariance of the point cloud $\left(a_i\right)i$, it encodes the covariances between the coordinates of the points. Denoting $a_i=\left(a{i, 1}, \ldots, a_{i, p}\right)^{\top} \in \mathbb{R}^p$ (so that $\left.A=\left(a_{i, j}\right){i, j}\right)$ the coordinates, one has $$ \forall(k, \ell) \in{1, \ldots, p}^2, \quad \frac{C{k, \ell}}{n}=\frac{1}{n} \sum_{i=1}^n a_{i, k} a_{i, \ell}
$$
In particular, $C_{k, k} / n$ is the variance along the axis $k$. More generally, for any unit vector $u \in \mathbb{R}^p,\langle C u, u\rangle / n \geqslant$ 0 is the variance along the axis $u$.
For instance, in dimension $p=2$,
$$
\frac{C}{n}=\frac{1}{n}\left(\sum_{i=1}^n a_{i, 1}^2 \quad \sum_{i=1}^n a_{i, 1} a_{i, 2}\right)
$$
Since $C$ is a symmetric, it diagonalizes in an ortho-basis $U=\left(u_1, \ldots, u_p\right) \in \mathbb{R}^{p \times p}$. Here, the vectors $u_k \in \mathbb{R}^p$ are stored in the columns of the matrix $U$. The diagonalization means that there exist scalars (the eigenvalues) $\left(\lambda_1, \ldots, \lambda_p\right)$ so that $\left(\frac{1}{n} C\right) u_k=\lambda_k u_k$. Since the matrix is orthogononal, $U U^{\top}=U^{\top} U=\mathrm{Id}_p$, and equivalently $U^{-1}=U^{\top}$. The diagonalization property can be conveniently written as $\frac{1}{n} C=U \operatorname{diag}\left(\lambda_k\right) U^{\top}$. One can thus re-write the covariance quadratic form in the basis $U$ as being a separable sum of $p$ squares
$$
\frac{1}{n}\langle C x, x\rangle=\left\langle U \operatorname{diag}\left(\lambda_k\right) U^{\top} x, x\right\rangle=\left\langle\operatorname{diag}\left(\lambda_k\right)\left(U^{\top} x\right),\left(U^{\top} x\right)\right\rangle=\sum_{k=1}^p \lambda_k\left\langle x, u_k\right\rangle^2 .
$$
Here $\left(U^{\top} x\right)_k=\left\langle x, u_k\right\rangle$ is the coordinate $k$ of $x$ in the basis $U$. Since $\langle C x, x\rangle=|A x|^2$, this shows that all the eigenvalues $\lambda_k \geqslant 0$ are positive.

机器学习中的优化理论代考

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Least Squares

最重要的梯度公式是平方损失 (3) 中的一个，可以通过扩展范数得到
$$
f(x+\varepsilon)=\frac{1}{2}|A x-y+A \varepsilon|^2=\frac{1}{2}|A x-y|+\langle A x-y, A \varepsilon\rangle+\frac{1}{2}|A \varepsilon|^2 \quad=f(x)+\left\langle\varepsilon, A^{\top}(A x\right.
$$
在这里，我们使用了这样一个事实 $|A \varepsilon|^2=o(|\varepsilon|)$ 并使用转置矩阵 $A^{\top}$. 这个矩阵是通过交换行和列得到的，即 $A^{\top}=\left(A_{j, i}\right) i=1, \ldots, n^{j=1, \cdots}$ ，但它应该被记住和使用的方式是它遵守以下内积交换规则，
$$
\forall(u, v) \in \mathbb{R}^p \times \mathbb{R}^n, \quad\langle A u, v\rangle \mathbb{R}^n=\left\langle u, A^{\top} v\right\rangle_{\mathbb{R}^p}
$$
计算涉及线性算子的函数的梯度必然需要这样的转置步骤。这个计算表明
$$
\nabla f(x)=A^{\top}(A x-y)
$$
这意味着解决方案 $x^{\star}$ 最小化 $f(x)$ 满足线性系统 $\left(A^{\top} A\right) x^{\star}=A^{\top} y$. 如果 $A^{\star} A \in \mathbb{R}^{p \times p}$ 是可逆的，那么 $f$ 有一个最小化器，即
$$
x^{\star}=\left(A^{\top} A\right)^{-1} A^{\top} y .
$$
这表明在这种情况下， $x^{\star}$ 线性依赖于数据 $y$ ，以及对应的线性算子 $\left(A^{\top} A\right)^{-1} A^{\star}$ 通常称为 Moore-Penrose 伪逆 $A$ (一般来说这是不可逆的，因为通常 $p \neq n$ ). 条件是 $A^{\top} A$ 是可逆的相当于 $\operatorname{ker}(A)=0$ ，自从
$$
A^{\top} A x=0 \quad \Longrightarrow|A x|^2=\left\langle A^{\top} A x, x\right\rangle=0 \quad \Longrightarrow \quad A x=0 .
$$
特别是，如果 $n<p$ (末确定的制度，参数太多或数据太少) 这永远不会成立。如果 $n \geqslant p$ 和功能 $x_i$ 那么是“随机的” $\operatorname{ker}(A)=0$ 概率为 1 。在这种多定的情况下 $n \geqslant p, \operatorname{ker}(A)=0$ 仅当特征成立 $\backslash$ left{a_ilright}_{i=1}^n 跨 $^{\prime}$ 越线性空间 $\operatorname{Im}\left(A^{\top}\right)$ 尺寸严格小于环境尺寸 $p$.

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Link with PCA

让我们假设 $\left(a_i\right) i=1^n$ 居中，即 $\sum_i a_i=0$. 如果不是这种情况，则需要更换 $a_i$ 经过 $a_i-m$ 在哪里 $m \stackrel{\text { def. }}{=} \frac{1}{n} \sum i=1^n a_i \in \mathbb{R}^p$ 是经验平均值。在这种情况下， $\frac{C}{n}=A^{\top} A / n \in \mathbb{R}^{p \times p}$ 是点云的经验协方差 $\left(a_i\right) i$ ，它对点坐标之间的协方差进行编码。表示 $a_i=\left(a i, 1, \ldots, a_{i, p}\right)^{\top} \in \mathbb{R}^p$ (以便 $\left.A=\left(a_{i, j}\right) i, j\right)$ 坐标，一个有
$$
\forall(k, \ell) \in 1, \ldots, p^2, \quad \frac{C k, \ell}{n}=\frac{1}{n} \sum_{i=1}^n a_{i, k} a_{i, \ell}
$$
尤其是， $C_{k, k} / n$ 是沿轴的方差 $k$. 更一般地，对于任何单位向量 $u \in \mathbb{R}^p,\langle C u, u\rangle / n \geqslant 0$ 是沿轴的方差 $u$. 例如，在维度 $p=2$ ，
$$
\frac{C}{n}=\frac{1}{n}\left(\sum_{i=1}^n a_{i, 1}^2 \quad \sum_{i=1}^n a_{i, 1} a_{i, 2}\right)
$$
自从 $C$ 是对称的，它在正交基上对角化 $U=\left(u_1, \ldots, u_p\right) \in \mathbb{R}^{p \times p}$. 在这里，载体 $u_k \in \mathbb{R}^p$ 存储在矩阵的列中 $U$. 对角化意味着存在标量 (特征值) $\left(\lambda_1, \ldots, \lambda_p\right)$ 以便 $\left(\frac{1}{n} C\right) u_k=\lambda_k u_k$. 由于矩阵是正交的，
$U U^{\top}=U^{\top} U=\operatorname{Id}p$ ，并且等价地 $U^{-1}=U^{\top}$. 对角化属性可以方便地写为 $\frac{1}{n} C=U \operatorname{diag}\left(\lambda_k\right) U^{\top}$. 因此可以重写基中的协方差二次形式 $U$ 作为一个可分离的总和 $p$ 正方形 $$ \frac{1}{n}\langle C x, x\rangle=\left\langle U \operatorname{diag}\left(\lambda_k\right) U^{\top} x, x\right\rangle=\left\langle\operatorname{diag}\left(\lambda_k\right)\left(U^{\top} x\right),\left(U^{\top} x\right)\right\rangle=\sum{k=1}^p \lambda_k\left\langle x, u_k\right\rangle^2
$$
这里 $\left(U^{\top} x\right)_k=\left\langle x, u_k\right\rangle$ 是坐标 $k$ 的 $x$ 在基础上 $U$. 自从 $\langle C x, x\rangle=|A x|^2$ ，这表明所有的特征值 $\lambda_k \geqslant 0$ 是积极的。

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考请认准statistics-lab™

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

金融工程是使用数学技术来解决金融问题。金融工程使用计算机科学、统计学、经济学和应用数学领域的工具和知识来解决当前的金融问题，以及设计新的和创新的金融产品。

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

术语广义线性模型（GLM）通常是指给定连续和/或分类预测因素的连续响应变量的常规线性回归模型。它包括多元线性回归，以及方差分析和方差分析（仅含固定效应）。

有限元方法代写

有限元方法（FEM）是一种流行的方法，用于数值解决工程和数学建模中出现的微分方程。典型的问题领域包括结构分析、传热、流体流动、质量运输和电磁势等传统领域。

有限元是一种通用的数值方法，用于解决两个或三个空间变量的偏微分方程（即一些边界值问题）。为了解决一个问题，有限元将一个大系统细分为更小、更简单的部分，称为有限元。这是通过在空间维度上的特定空间离散化来实现的，它是通过构建对象的网格来实现的：用于求解的数值域，它有有限数量的点。边界值问题的有限元方法表述最终导致一个代数方程组。该方法在域上对未知函数进行逼近。[1] 然后将模拟这些有限元的简单方程组合成一个更大的方程系统，以模拟整个问题。然后，有限元通过变化微积分使相关的误差函数最小化来逼近一个解决方案。

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

随机分析代写

随机微积分是数学的一个分支，对随机过程进行操作。它允许为随机过程的积分定义一个关于随机过程的一致的积分理论。这个领域是由日本数学家伊藤清在第二次世界大战期间创建并开始的。

时间序列分析代写

随机过程，是依赖于参数的一组随机变量的全体，参数通常是时间。随机变量是随机现象的数量表现，其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值（如1秒，5分钟，12小时，7天，1年），因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中，往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录，以得到其自身发展的规律。

回归分析代写

多元回归分析渐进（Multiple Regression Analysis Asymptotics）属于计量经济学领域，主要是一种数学上的统计分析方法，可以分析复杂情况下各影响因素的数学关系，在自然科学、社会和经济学等多个领域内应用广泛。

MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习和应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|EECS559

Posted on 2022年12月1日2022年12月1日 by statistics-lab

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Derivative and gradient

If $f$ is differentiable along each axis, we denote
$$
\nabla f(x) \stackrel{\text { def. }}{=}\left(\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_p}\right)^{\top} \in \mathbb{R}^p
$$
the gradient vector, so that $\nabla f: \mathbb{R}^p \rightarrow \mathbb{R}^p$ is a vector field. Here the partial derivative (when they exits) are defined as
$$
\frac{\partial f(x)}{\partial x_k} \stackrel{\text { def. }}{=} \lim _{\eta \rightarrow 0} \frac{f\left(x+\eta \delta_k\right)-f(x)}{\eta}
$$
where $\delta_k=(0, \ldots, 0,1,0, \ldots, 0)^{\top} \in \mathbb{R}^p$ is the $k^{\text {th }}$ canonical basis vector.
Beware that $\nabla f(x)$ can exist without $f$ being differentiable. Differentiability of $f$ at each reads
$$
f(x+\varepsilon)=f(x)+\langle\varepsilon, \nabla f(x)\rangle+o(|\varepsilon|) .
$$
Here $R(\varepsilon)=o(\mid \varepsilon |)$ denotes a quantity which decays faster than $\varepsilon$ toward 0 , i.e. $\frac{R(\varepsilon)}{| \varepsilon \mid} \rightarrow 0$ as $\varepsilon \rightarrow 0$. Existence of partial derivative corresponds to $f$ being differentiable along the axes, while differentiability should hold for any converging sequence of $\varepsilon \rightarrow 0$ (i.e. not along along a fixed direction). A counter example in 2-D is $f(x)=\frac{2 x_1 x_2\left(x_1+x_2\right)}{x_1^2+x_2^2}$ with $f(0)=0$, which is affine with different slope along each radial lines.

Also, $\nabla f(x)$ is the only vector such that the relation (7). This means that a possible strategy to both prove that $f$ is differentiable and to obtain a formula for $\nabla f(x)$ is to show a relation of the form
$$
f(x+\varepsilon)=f(x)+\langle\varepsilon, g\rangle+o(|\varepsilon|),
$$
in which case one necessarily has $\nabla f(x)=g$.
The following proposition shows that convexity is equivalent to the graph of the function being above its tangents.

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|First Order Conditions

The main theoretical interest (we will see later that it also have algorithmic interest) of the gradient vector is that it is a necessarily condition for optimality, as stated below.

Proposition 2. If $x^{\star}$ is a local minimum of the function $f$ (i.e. that $f\left(x^{\star}\right) \leqslant f(x)$ for all $x$ in some ball around $x^{\star}$ ) then
$$
\nabla f\left(x^{\star}\right)=0 .
$$
Proof. One has for $\varepsilon$ small enough and $u$ fixed
$$
f\left(x^{\star}\right) \leqslant f\left(x^{\star}+\varepsilon u\right)=f\left(x^{\star}\right)+\varepsilon\left\langle\nabla f\left(x^{\star}\right), u\right\rangle+o(\varepsilon) \quad \Longrightarrow\left\langle\nabla f\left(x^{\star}\right), u\right\rangle \geqslant o(1) \quad \Longrightarrow \quad\left\langle\nabla f\left(x^{\star}\right), u\right\rangle \geqslant 0 .
$$
So applying this for $u$ and $-u$ in the previous equation shows that $\left\langle\nabla f\left(x^{\star}\right), u\right\rangle=0$ for all $u$, and hence $\nabla f\left(x^{\star}\right)=0$

Note that the converse is not true in general, since one might have $\nabla f(x)=0$ but $x$ is not a local mininimum. For instance $x=0$ for $f(x)=-x^2$ (here $x$ is a maximizer) or $f(x)=x^3$ (here $x$ is neither a maximizer or a minimizer, it is a saddle point), see Fig. 6 . Note however that in practice, if $\nabla f\left(x^{\star}\right)=0$ but $x$ is not a local minimum, then $x^{\star}$ tends to be an unstable equilibrium. Thus most often a gradient-based algorithm will converge to points with $\nabla f\left(x^{\star}\right)=0$ that are local minimizers. The following proposition shows that a much strong result holds if $f$ is convex.

Proposition 3. If $f$ is convex and $x^{\star}$ a local minimum, then $x^{\star}$ is also a global minimum. If $f$ is differentiable and convex,
$$
x^{\star} \in \underset{x}{\operatorname{argmin}} f(x) \Longleftrightarrow \nabla f\left(x^{\star}\right)=0 .
$$
Proof. For any $x$, there exist $0<t<1$ small enough such that $t x+(1-t) x^{\star}$ is close enough to $x^{\star}$, and so since it is a local minimizer
$$
f\left(x^{\star}\right) \leqslant f\left(t x+(1-t) x^{\star}\right) \leqslant t f(x)+(1-t) f\left(x^{\star}\right) \quad \Longrightarrow \quad f\left(x^{\star}\right) \leqslant f(x)
$$
and thus $x^{\star}$ is a global minimum.
For the second part, we already saw in (2) the $\Leftarrow$ part. We assume that $\nabla f\left(x^{\star}\right)=0$. Since the graph of $x$ is above its tangent by convexity (as stated in Proposition 1),
$$
f(x) \geqslant f\left(x^{\star}\right)+\left\langle\nabla f\left(x^{\star}\right), x-x^{\star}\right\rangle=f\left(x^{\star}\right) .
$$
Thus in this case, optimizing a function is the same a solving an equation $\nabla f(x)=0$ (actually $p$ equations in $p$ unknown). In most case it is impossible to solve this equation, but it often provides interesting information about solutions $x^{\star}$.

机器学习中的优化理论代考

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Derivative and gradient

如果 $f$ 沿每个轴可微分，我们表示
$$
\nabla f(x) \stackrel{\text { def. }}{=}\left(\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_p}\right)^{\top} \in \mathbb{R}^p
$$
梯度向量，因此 $\nabla f: \mathbb{R}^p \rightarrow \mathbb{R}^p$ 是矢量场。这里的偏导数（当它们退出时) 定义为
$$
\frac{\partial f(x)}{\partial x_k} \stackrel{\text { def. }}{=} \lim _{\eta \rightarrow 0} \frac{f\left(x+\eta \delta_k\right)-f(x)}{\eta}
$$
在哪里 $\delta_k=(0, \ldots, 0,1,0, \ldots, 0)^{\top} \in \mathbb{R}^p$ 是个 $k^{\text {th }}$ 规范基向量。当心那个 $\nabla f(x)$ 可以存在没有 $f$ 是可区分的。的可微性 $f$ 在每次读取
$$
f(x+\varepsilon)=f(x)+\langle\varepsilon, \nabla f(x)\rangle+o(|\varepsilon|) .
$$
这里 $R(\varepsilon)=o(|\varepsilon|)$ 表示衰减速度快于 $\varepsilon$ 趋于 0 ，即 $\frac{R(\varepsilon)}{|\varepsilon|} \rightarrow 0$ 作为 $\varepsilon \rightarrow 0$. 偏导数的存在对应于 $f$ 沿轴可微，而可微性应适用于任何收敛序列 $\varepsilon \rightarrow 0$ (即不沿若固定方向) 。二维中的一个反例是 $f(x)=\frac{2 x_1 x_2\left(x_1+x_2\right)}{x_1^2+x_2^2}$ 和 $f(0)=0$ ，沿每条径向线具有不同的斜率仿射。
还， $\nabla f(x)$ 是唯一满足关系 (7) 的向量。这意味着一个可能的策略来证明 $f$ 是可微的，并得到一个公式 $\nabla f(x)$ 是显示形式的关系
$$
f(x+\varepsilon)=f(x)+\langle\varepsilon, g\rangle+o(|\varepsilon|),
$$
在这种情况下，一个人必然有 $\nabla f(x)=g$.
下面的命题表明凸性等价于函数的图形在其切线之上。

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|First Order Conditions

梯度向量的主要理论意义 (我们稍后会看到它也有算法意义) 是它是最优性的必要条件，如下所述。
命题 2. 如果 $x^{\star}$ 是函数的局部最小值 $f$ (即那个 $f\left(x^{\star}\right) \leqslant f(x)$ 对所有人 $x$ 在一些球周围 $x^{\star}$ ) 然后
$$
\nabla f\left(x^{\star}\right)=0 .
$$
证明。一个有 $\varepsilon$ 足够小并且 $u$ 固定的
$$
f\left(x^{\star}\right) \leqslant f\left(x^{\star}+\varepsilon u\right)=f\left(x^{\star}\right)+\varepsilon\left\langle\nabla f\left(x^{\star}\right), u\right\rangle+o(\varepsilon) \quad \Longrightarrow\left\langle\nabla f\left(x^{\star}\right), u\right\rangle \geqslant o(1) \quad \Longrightarrow \quad\left\langle\nabla f\left(x^{\star}\right)\right.
$$
所以申请这个 $u$ 和 $-u$ 在前面的等式中表明 $\left\langle\nabla f\left(x^{\star}\right), u\right\rangle=0$ 对所有人 $u$ ，因此 $\nabla f\left(x^{\star}\right)=0$
请注意，通常情况下情况并非如此，因为一个人可能有 $\nabla f(x)=0$ 但 $x$ 不是局部最小值。例如 $x=0$ 为了 $f(x)=-x^2$ (这里 $x$ 是最大化器) 或 $f(x)=x^3$ (这里 $x$ 既不是最大化器也不是最小化器，它是一个鞍点)，见图 6。但是请注意，在实践中，如果 $\nabla f\left(x^{\star}\right)=0$ 但 $x$ 不是局部最小值，那么 $x^{\star}$ 趋于不稳定的平衡。因此，大多数情况下，基于梯度的算法将收敛到具有 $\nabla f\left(x^{\star}\right)=0$ 是局部最小化器。下面的命题表明如果 $f$ 是凸的。
命题 3. 如果 $f$ 是凸的并且 $x^{\star} 一个$ 局部最小值，然后 $x^{\star}$ 也是全局最小值。如果 $f$ 是可微且凸的，
$$
x^{\star} \in \underset{x}{\operatorname{argmin}} f(x) \Longleftrightarrow \nabla f\left(x^{\star}\right)=0 .
$$
证明。对于任何 $x$ ，存在 $0<t<1$ 足够小以至于 $t x+(1-t) x^{\star}$ 足够接近 $x^{\star}$ ，所以因为它是局部最小化器
$$
f\left(x^{\star}\right) \leqslant f\left(t x+(1-t) x^{\star}\right) \leqslant t f(x)+(1-t) f\left(x^{\star}\right) \quad \Longrightarrow \quad f\left(x^{\star}\right) \leqslant f(x)
$$
对于第二部分，我们已经在 (2) 中看到了と部分。我们假设 $\nabla f\left(x^{\star}\right)=0$. 由于图 $x$ 高于其凸性切线（如命题 1 所述），
$$
f(x) \geqslant f\left(x^{\star}\right)+\left\langle\nabla f\left(x^{\star}\right), x-x^{\star}\right\rangle=f\left(x^{\star}\right) .
$$
因此在这种情况下，优化函数与求解方程相同 $\nabla f(x)=0$ (实际上 $p$ 中的方程式 $p$ 末知) 。在大多数情况下，求解亥方程是不可能的，但它通常会提供有关解的有趣信息 $x^{\star}$.

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|COMS4995

Posted on 2022年12月1日2022年12月1日 by statistics-lab

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Basics of Convex Analysis

In general, there might be no solution to the optimization (1). This is of course the case if $f$ is unbounded by below, for instance $f(x)=-x^2$ in which case the value of the minimum is $-\infty$. But this might also happen if $f$ does not grow at infinity, for instance $f(x)=e^{-x}$, for which min $f=0$ but there is no minimizer. In order to show existence of a minimizer, and that the set of minimizer is bounded (otherwise one can have problems with optimization algorithm that could escape to infinity), one needs to show that one can replace the whole space $\mathbb{R}^p$ by a compact sub-set $\Omega \subset \mathbb{R}^p$ (i.e. $\Omega$ is bounded and close) and that $f$ is continuous on $\Omega$ (one can replace this by a weaker condition, that $f$ is lower-semi-continuous, but we ignore this here). A way to show that one can consider only a bounded set is to show that $f(x) \rightarrow+\infty$ when $x \rightarrow+\infty$. Such a function is called coercive. In this case, one can choose any $x_0 \in \mathbb{R}^p$ and consider its associated lower-level set
$$
\Omega=\left{x \in \mathbb{R}^p ; f(x) \leqslant f\left(x_0\right)\right}
$$
which is bounded because of coercivity, and closed because $f$ is continuous. One can actually show that for convex function, having a bounded set of minimizer is equivalent to the function being coercive (this is not the case for non-convex function, for instance $f(x)=\min \left(1, x^2\right)$ has a single minimum but is not coercive).
Example 1 (Least squares). For instance, for the quadratic loss function $f(x)=\frac{1}{2}|A x-y|^2$, coercivity holds if and only if $\operatorname{ker}(A)={0}$ (this corresponds to the overdetermined setting). Indeed, if $\operatorname{ker}(A) \neq{0}$ if $x^{\star}$ is a solution, then $x^{\star}+u$ is also solution for any $u \in \operatorname{ker}(A)$, so that the set of minimizer is unbounded. On contrary, if $\operatorname{ker}(A)={0}$, we will show later that the set of minimizer is unique, see Fig. 3 . If $\ell$ is strictly convex, the same conclusion holds in the case of classification.

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Convexity

Convex functions define the main class of functions which are somehow “simple” to optimize, in the sense that all minimizers are global minimizers, and that there are often efficient methods to find these minimizers (at least for smooth convex functions). A convex function is such that for any pair of point $(x, y) \in\left(\mathbb{R}^p\right)^2$,
$$
\forall t \in[0,1], \quad f((1-t) x+t y) \leqslant(1-t) f(x)+t f(y)
$$
which means that the function is below its secant (and actually also above its tangent when this is well defined), see Fig. 4 . If $x^{\star}$ is a local minimizer of a convex $f$, then $x^{\star}$ is a global minimizer, i.e. $x^{\star} \in$ argmin $f$. Convex function are very convenient because they are stable under lots of transformation. In particular, if $f, g$ are convex and $a, b$ are positive, $a f+b g$ is convex (the set of convex function is itself an infinite dimensional convex cone!) and so is $\max (f, g)$. If $g: \mathbb{R}^q \rightarrow \mathbb{R}$ is convex and $B \in \mathbb{R}^{q \times p}, b \in \mathbb{R}^q$ then $f(x)=g(B x+b)$ is convex. This shows immediately that the square loss appearing in (3) is convex, since $|$. $|^2 / 2$ is convex (as a sum of squares). Also, similarly, if $\ell$ and hence $L$ is convex, then the classification loss function (4) is itself convex.

Strict convexity. When $f$ is convex, one can strengthen the condition (5) and impose that the inequality is strict for $t \in] 0,1[$ (see Fig. 4, right), i.e.
$$
\forall t \in] 0,1[, \quad f((1-t) x+t y)<(1-t) f(x)+t f(y) .
$$
In this case, if a minimum $x^{\star}$ exists, then it is unique. Indeed, if $x_1^{\star} \neq x_2^{\star}$ were two different minimizer, one would have by strict convexity $f\left(\frac{x_i^+x_2^}{2}\right)<f\left(x_1^{\star}\right)$ which is impossible.
Example 2 (Least squares). For the quadratic loss function $f(x)=\frac{1}{2}|A x-y|^2$, strict convexity is equivalent to $\operatorname{ker}(A)={0}$. Indeed, we see later that its second derivative is $\partial^2 f(x)=A^{\top} A$ and that strict convexity is implied by the eigenvalues of $A^{\top} A$ being strictly positive. The eigenvalues of $A^{\top} A$ being positive, it is equivalent to $\operatorname{ker}\left(A^{\top} A\right)={0}$ (no vanishing eigenvalue), and $A^{\top} A z=0$ implies $\left\langle A^{\top} A z, z\right\rangle=\mid A z |^2=0$ i.e. $z \in \operatorname{ker}(A)$

机器学习中的优化理论代考

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Basics of Convex Analysis

通常，优化 (1) 可能无解。这当然是这样的，如果 $f$ 不受以下限制，例如 $f(x)=-x^2$ 在这种情况下，最小值是 $-\infty$. 但这也可能发生，如果 $f$ 不会无限增长，例如 $f(x)=e^{-x}$ ，其中分钟 $f=0$ 但没有最小化器。为了证明最小化器的存在，并且最小化器的集合是有界的（否则优化算法可能会出现问题，可能会逃逸到无穷大)，需要证明可以替换整个空间 $\mathbb{R}^p$ 通过一个紧凑的子集 $\Omega \subset \mathbb{R}^p$ (IE $\Omega$ 是有界且接近的) 并且 $f$ 是连续的 $\Omega$ (可以用一个较弱的条件代替它，即 $f$ 是下半连续的，但我们在这里忽略它) 。证明只能考虑有界集的一种方法是证明
$f(x) \rightarrow+\infty$ 什么时候 $x \rightarrow+\infty$. 这种功能称为强制性。在这种情况下，可以选择任何 $x_0 \in \mathbb{R}^p$ 并考虑其相关的低层集
IOmega $=\backslash$ left ${x$ \in $\backslash m a t h b b{R} \wedge p ; f(x)$ \eqslant flleft(x_o\right)\right } }
由于矨顽力而有界，由于 $f$ 是连续的。实际上可以证明，对于凸函数，具有一组有界的最小值等价于函数是强制的（例如，非凸函数不是这种情况 $f(x)=\min \left(1, x^2\right)$ 有一个最低限度但不是强制性的）。
示例 1 (最小二乘法) 。例如，对于二次损失函数 $f(x)=\frac{1}{2}|A x-y|^2$ ，知颁力成立当且仅当 $\operatorname{ker}(A)=0$
(这对应于超定设置) 。的确，如果 $\operatorname{ker}(A) \neq 0$ 如果 $x^{\star}$ 是解，那么 $x^{\star}+u$ 也是任何解决方案 $u \in \operatorname{ker}(A)$ ，所以最小化器的集合是无界的。相反，如果 $\operatorname{ker}(A)=0$ ，我们稍后将证明最小化器的集合是唯一的，见图 3。如果 $\ell$ 是严格凸的，同样的结论在分类的情况下成立。

数学代写|机器学习中的优化理论代写OPTIMIZATION FOR MACHINE LEARNING代考|Convexity

凸函数定义了在某种程度上”简单”优化的函数的主要类别，因为所有最小化器都是全局最小化器，并且通常有有效的方法来找到这些最小化器（至少对于平滑凸函数）。凸函数是这样的，对于任何一对点 $(x, y) \in\left(\mathbb{R}^p\right)^2$ ，
$$
\forall t \in[0,1], \quad f((1-t) x+t y) \leqslant(1-t) f(x)+t f(y)
$$
这意味着该函数低于其割线 (并且在明确定义时实际上也高于其切线)，参见图 4 。如果 $x^{\star}$ 是凸的局部最小值 $f$ ，然后 $x^{\star}$ 是全局最小化器，即 $x^{\star} \in$ 精定酸 $f$. 凸函数非常方便，因为它们在大量变换下是稳定的。特别是，如果 $f, g$ 是凸的和 $a, b$ 是积极的， $a f+b g$ 是凸的（凸函数集本身就是一个无限维的凸锥!）所以是 $\max (f, g)$. 如果 $g: \mathbb{R}^q \rightarrow \mathbb{R}$ 是凸的并且 $B \in \mathbb{R}^{q \times p}, b \in \mathbb{R}^q$ 然后 $f(x)=g(B x+b)$ 是凸的。这立即表明 (3) 中出现的平方损失是凸的，因为 $|.|^2 / 2$ 是凸的 (作为平方和) 。同样，如果 $\ell$ 因此 $L$ 是凸的，那么分类损失函数（4）本身就是凸的。
严格的凸性。什么时候 $f$ 是凸的，可以加强条件 (5) 并强加不等式是严格的 $t \in] 0,1$ [（见图4右），即
$$
\forall t \in] 0,1[, \quad f((1-t) x+t y)<(1-t) f(x)+t f(y) .
$$
在这种情况下，如果至少 $x^{\star}$ 存在，则唯一。的确，如果 $x_1^{\star} \neq x_2^{\star}$ 是两个不同的最小化器，一个会通过严格的凸
示例 2 (最小二乘法) 。对于二次损失函数 $f(x)=\frac{1}{2}|A x-y|^2$ ，严格凸性等价于 $\operatorname{ker}(A)=0$. 事实上，我们稍后会看到它的二阶导数是 $\partial^2 f(x)=A^{\top} A$ 并且严格的凸性由特征值暗示 $A^{\top} A$ 是严格积极的。的特征值 $A^{\top} A$ 为正，相当于 $\operatorname{ker}\left(A^{\top} A\right)=0$ (没有消失的特征值)，和 $A^{\top} A z=0$ 暗示 $\left\langle A^{\top} A z, z\right\rangle=|A z|^2=0$ I $z \in \operatorname{ker}(A)$

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写