自然语言处理代写 - 统计代写答疑辅导

分类：自然语言处理代写

机器学习代写|自然语言处理代写NLP代考|WHAT IS IMBALANCED CLASSIFICATION

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

statistics-lab™ 为您的留学生涯保驾护航在代写自然语言处理NLP方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写自然语言处理NLP代写方面经验极为丰富，各种代写自然语言处理NLP相关的作业也就用不着说。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|WHAT IS IMBALANCED CLASSIFICATION

Imbalanced classification involves datasets with imbalanced classes. For example, suppose that class A has $99 \%$ of the data and class B has $1 \%$. Which classification algorithm would you use? Unfortunately, classification algorithms

don’t work well with this type of imbalanced dataset. Here is a list of several well-known techniques for handling imbalanced datasets:

Random resampling rebalances the class distribution.
Random oversampling duplicates data in the minority class.
Random undersampling deletes examples from the majority class.
SMOTE
Random resampling transforms the training dataset into a new dataset, which is effective for imbalanced classification problems.

The random undersampling technique removes samples from the dataset, and involves the following:

randomly remove samples from majority class
can be performed with or without replacement
alleviates imbalance in the dataset
may increase the variance of the classifier
may discard useful or important samples
However, random undersampling does not work well with a dataset that has a $99 \% / 1 \%$ split into two classes. Moreover, undersampling can result in losing information that is useful for a model.

Instead of random undersampling, another approach involves generating new samples from a minority class. The first technique involves oversampling examples in the minority class and duplicate examples from the minority class.
There is another technique that is better than the preceding technique, which involves the following:

synthesize new examples from minority class
a type of data augmentation for tabular data
this technique can be very effective
generate new samples from minority class
Another well-known technique is called SMOTE, which involves data augmentation (i.e., synthesizing new data samples) well before you use a classification algorithm. SMOTE was initially developed by means of the kNN algorithm (other options are available), and it can be an effective technique for handling imbalanced classes.

Yet another option to consider is the Python package imbal anced-learn in the scikit-learn-contrib project. This project provides various re-sampling techniques for datasets that exhibit class imbalance. More details are available online:
https://github.com/scikit-learn-contrib/imbalanced-learn.

机器学习代写|自然语言处理代写NLP代考|WHAT IS SMOTE

SMOTE is a technique for synthesizing new samples for a dataset. This technique is based on linear interpolation:

Step 1: Select samples that are close in the feature space.
Step 2: Draw a line between the samples in the feature space.
Step 3: Draw a new sample at a point along that line.
A more detailed explanation of the SMOTE algorithm is as follows:
Select a random sample “a” from the minority class.
Find $\mathrm{k}$ nearest neighbors for that example.
Select a random neighbor “b” from the nearest neighbors.
Create a line “L” that connects “a” and “b.”
Randomly select one or more points “c” on line L.
If need be, you can repeat this process for the other $(\mathrm{k}-1)$ nearest neighbors to distribute the synthetic values more evenly among the nearest neighbors.

The initial SMOTE algorithm is based on the kNN classification algorithm, which has been extended in various ways, such as replacing $\mathrm{kNN}$ with SVM. A list of SMOTE extensions is shown as follows:

selective synthetic sample generation
Borderline-SMOTE (kNN)
Borderline-SMOTE (SVM)
Adaptive Synthetic Sampling (ADASYN)

机器学习代写|自然语言处理代写NLP代考|ANALYZING CLASSIFIERS

This section is marked “optional” because its contents pertain to machine learning classifiers, which are not the focus of this book. However, it’s still worthwhile to glance through the material, or perhaps return to this section after you have a basic understanding of machine learning classifiers.

Several well-known techniques are available for analyzing the quality of machine learning classifiers. Two techniques are LIME and ANOVA, both of which are discussed in the following subsections.

LIME is an acronym for Local Interpretable Model-Agnostic Explanations. LIME is a model-agnostic technique that can be used with machine learning models. In LIME, you make small random changes to data samples and then observe the manner in which predictions change (or not). The approach involves changing the output (slightly) and then observing what happens to the output.

By way of analogy, consider food inspectors who test for bacteria in truckloads of perishable food. Clearly, it’s infeasible to test every food item in a truck (or a train car), so inspectors perform “spot checks” that involve testing randomly selected items. In an analogous fashion, LIME makes small changes to input data in random locations and then analyzes the changes in the associated output values.

However, there are two caveats to keep in mind when you use LIME with input data for a given model:

The actual changes to input values are model-specific.
This technique works on input that is interpretable.
Examples of interpretable input include machine learning classifiers (such as trees and random forests) and NLP techniques such as BoW (Bag of Words). Non-interpretable input involves “dense” data, such as a word embedding (which is a vector of floating point numbers).

You could also substitute your model with another model that involves interpretable data, but then you need to evaluate how accurate the approximation is to the original model.

NLP代考

机器学习代写|自然语言处理代写NLP代考|WHAT IS IMBALANCED CLASSIFICATION

不平衡分类涉及具有不平衡类的数据集。例如，假设 A 类有99%的数据和 B 类有1%. 你会使用哪种分类算法？不幸的是，分类算法

不适用于这种类型的不平衡数据集。以下是处理不平衡数据集的几种著名技术的列表：

随机重采样重新平衡类分布。
随机过采样会复制少数类中的数据。
随机欠采样从多数类中删除示例。
SMOTE
随机重采样将训练数据集转换为新的数据集，这对于不平衡的分类问题是有效的。

随机欠采样技术从数据集中删除样本，并涉及以下内容：

从多数类中随机删除样本
可以在有或没有更换的情况下进行
减轻数据集中的不平衡
可能会增加分类器的方差
可能会丢弃有用或重要的样本
但是，随机欠采样不适用于具有99%/1%分为两类。此外，欠采样会导致丢失对模型有用的信息。

另一种方法不是随机欠采样，而是从少数类中生成新样本。第一种技术涉及对少数类中的示例进行过采样，并从少数类中复制示例。
还有另一种技术比前面的技术更好，它涉及以下内容：

从少数类中合成新的例子
表格数据的一种数据扩充
这种技术非常有效
从少数类生成新样本
另一种众所周知的技术称为 SMOTE，它在使用分类算法之前就涉及数据增强（即合成新数据样本）。SMOTE 最初是通过 kNN 算法（其他选项可用）开发的，它可以成为处理不平衡类的有效技术。

另一个需要考虑的选项是 scikit-learn-contrib 项目中的 Python 包 imbal anced-learn。该项目为表现出类不平衡的数据集提供了各种重新采样技术。更多详细信息可在线获取：
https://github.com/scikit-learn-contrib/imbalanced-learn。

机器学习代写|自然语言处理代写NLP代考|WHAT IS SMOTE

SMOTE 是一种为数据集合成新样本的技术。该技术基于线性插值：

步骤 1：选择特征空间中相近的样本。
第 2 步：在特征空间中的样本之间画一条线。
第 3 步：在沿该线的一点绘制一个新样本。
SMOTE算法更详细的解释如下：
从少数类中选择一个随机样本“a”。
寻找ķ该示例的最近邻居。
从最近的邻居中选择一个随机邻居“b”。
创建一条连接“a”和“b”的线“L”。
在L线上随机选择一个或多个点“c”。
如果需要，您可以对另一个重复此过程(ķ−1)最近的邻居在最近的邻居之间更均匀地分配合成值。

最初的 SMOTE 算法是基于 kNN 分类算法，经过各种方式扩展，例如替换ķññ与支持向量机。SMOTE 扩展列表如下所示：

选择性合成样品生成
边界-SMOTE (kNN)
边界-SMOTE (SVM)
自适应合成采样 (ADASYN)

机器学习代写|自然语言处理代写NLP代考|ANALYZING CLASSIFIERS

这部分被标记为“可选”，因为它的内容与机器学习分类器有关，这不是本书的重点。但是，仍然值得浏览一下材料，或者在您对机器学习分类器有基本了解后返回本节。

几种众所周知的技术可用于分析机器学习分类器的质量。两种技术是 LIME 和 ANOVA，这两种技术都将在以下小节中讨论。

LIME 是 Local Interpretable Model-Agnostic Explanations 的首字母缩写词。LIME 是一种与模型无关的技术，可与机器学习模型一起使用。在 LIME 中，您对数据样本进行小的随机更改，然后观察预测更改（或不更改）的方式。该方法涉及（稍微）更改输出，然后观察输出发生了什么。

以类比的方式，考虑食品检查员在卡车装载的易腐食品中检测细菌。显然，对卡车（或火车车厢）中的每一种食品进行检测是不可行的，因此检查员会进行“抽查”，包括对随机选择的食品进行检测。以类似的方式，LIME 对随机位置的输入数据进行微小更改，然后分析相关输出值的变化。

但是，当您将 LIME 与给定模型的输入数据一起使用时，需要牢记两个注意事项：

输入值的实际变化是特定于模型的。
这种技术适用于可解释的输入。
可解释输入的示例包括机器学习分类器（例如树和随机森林）和 NLP 技术，例如 BoW（词袋）。不可解释的输入涉及“密集”数据，例如词嵌入（它是浮点数的向量）。

您也可以用另一个涉及可解释数据的模型替换您的模型，但随后您需要评估该近似值对原始模型的准确程度。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

金融工程是使用数学技术来解决金融问题。金融工程使用计算机科学、统计学、经济学和应用数学领域的工具和知识来解决当前的金融问题，以及设计新的和创新的金融产品。

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

术语广义线性模型（GLM）通常是指给定连续和/或分类预测因素的连续响应变量的常规线性回归模型。它包括多元线性回归，以及方差分析和方差分析（仅含固定效应）。

有限元方法代写

有限元方法（FEM）是一种流行的方法，用于数值解决工程和数学建模中出现的微分方程。典型的问题领域包括结构分析、传热、流体流动、质量运输和电磁势等传统领域。

有限元是一种通用的数值方法，用于解决两个或三个空间变量的偏微分方程（即一些边界值问题）。为了解决一个问题，有限元将一个大系统细分为更小、更简单的部分，称为有限元。这是通过在空间维度上的特定空间离散化来实现的，它是通过构建对象的网格来实现的：用于求解的数值域，它有有限数量的点。边界值问题的有限元方法表述最终导致一个代数方程组。该方法在域上对未知函数进行逼近。[1] 然后将模拟这些有限元的简单方程组合成一个更大的方程系统，以模拟整个问题。然后，有限元通过变化微积分使相关的误差函数最小化来逼近一个解决方案。

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

随机分析代写

随机微积分是数学的一个分支，对随机过程进行操作。它允许为随机过程的积分定义一个关于随机过程的一致的积分理论。这个领域是由日本数学家伊藤清在第二次世界大战期间创建并开始的。

时间序列分析代写

随机过程，是依赖于参数的一组随机变量的全体，参数通常是时间。随机变量是随机现象的数量表现，其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值（如1秒，5分钟，12小时，7天，1年），因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中，往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录，以得到其自身发展的规律。

回归分析代写

多元回归分析渐进（Multiple Regression Analysis Asymptotics）属于计量经济学领域，主要是一种数学上的统计分析方法，可以分析复杂情况下各影响因素的数学关系，在自然科学、社会和经济学等多个领域内应用广泛。

MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习和应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|MISSING DATA, ANOMALIES, AND OUTLIERS

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Missing Data

How you decide to handle missing data depends on the specific dataset. Here are some ways to handle missing data (the first three techniques are manual techniques, and the other techniques are algorithms):

replace missing data with the mean/median/mode value
infer (“impute”) the value for missing data
delete rows with missing data
isolation forest (tree-based algorithm)
minimum covariance determinant
local outlier factor
one-class SVM (Support Vector Machines)
In general, replacing a missing numeric value with zero is a risky choice: this value is obviously incorrect if the values of a feature are between 1,000 and 5,000 . For a feature that has numeric values, replacing a missing value with the average value is better than the value zero (unless the average equals zero); also consider using the median value. For categorical data, consider using the mode to replace a missing value.

If you are not confident that you can impute a “reasonable” value, consider dropping the row with a missing value, and then train a model with the imputed value and also with the deleted row.

One problem that can arise after removing rows with missing values is that the resulting dataset is too small. In this case, consider using SMOTE, which is discussed later in this chapter, in order to generate synthetic data.

机器学习代写|自然语言处理代写NLP代考|Anomalies and Outliers

In simplified terms, an outlier is an abnormal data value that is outside the range of “normal” values. For example, a person’s height in centimeters is typically between 30 centimeters and 250 centimeters. Hence, a data point (e.g., a row of data in a spreadsheet) with a height of 5 centimeters or a height of 500 centimeters is an outlier. The consequences of these outlier values are unlikely to involve a significant financial or physical loss (though they could adversely affect the accuracy of a trained model).

Anomalies are also outside the “normal” range of values (just like outliers), and they are typically more problematic than outliers: anomalies can have more severe consequences than outliers. For example, consider the scenario in which someone who lives in California suddenly makes a credit

card purchase in New York. If the person is on vacation (or a business trip), then the purchase is an outlier (it’s outside the typical purchasing pattern), but it’s not an issue. However, if that person was in California when the credit card purchase was made, then it’s most likely to be credit card fraud, as well as an anomaly.

Unfortunately, there is no simple way to decide how to deal with anomalies and outliers in a dataset. Although you can drop rows that contain outliers, keep in mind that doing so might deprive the dataset-and therefore the trained model – of valuable information. You can try modifying the data values (described as follows), but again, this might lead to erroneous inferences in the trained model. Another possibility is to train a model with the dataset that contains anomalies and outliers, and then train a model with a dataset from which the anomalies and outliers have been removed. Compare the two results and see if you can infer anything meaningful regarding the anomalies and outliers.

机器学习代写|自然语言处理代写NLP代考|Outlier Detection

Although the decision to keep or drop outliers is your decision to make, there are some techniques available that help you detect outliers in a dataset. This section contains a short list of some techniques, along with a very brief description and links for additional information.

Perhaps trimming is the simplest technique (apart from dropping outliers), which involves removing rows whose feature value is in the upper $5 \%$ range or the lower $5 \%$ range. Winsorizing the data is an improvement over trimming: set the values in the top $5 \%$ range equal to the maximum value in the 95 th percentile, and set the values in the bottom $5 \%$ range equal to the minimum in the 5th percentile.

The Minimum Covariance Determinant is a covariance-based technique, and a Python-based code sample that uses this technique is available online:
https://scikit-learn.org/stable/modules/outlier_detection.html.
The Local Outlier Factor (LOF) technique is an unsupervised technique that calculates a local anomaly score via the kNN (k Nearest Neighbor) algorithm. Documentation and short code samples that use LOF are available online:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors. LocalOutlierFactor.html.

Two other techniques involve the Huber and the Ridge classes, both of which are included as part of Sklearn. The Huber error is less sensitive to

outliers because it’s calculated via the linear loss, similar to the MAE (Mean Absolute Error). A code sample that compares Huber and Ridge is available online:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_huber_ ts_ridge.html.

You can also explore the Theil-Sen estimator and RANSAC, which are “robust” against outliers:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_theilsen. html and
https://en.wikipedia.org/wiki/Random_sample_consensus.
Four algorithms for outlier detection are discussed at the following site:
https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection. html.

One other scenario involves “local” outliers. For example, suppose that you use kMeans (or some other clustering algorithm) and determine that a value is an outlier with respect to one of the clusters. While this value is not necessarily an “absolute” outlier, detecting such a value might be important for your use case.

NLP代考

机器学习代写|自然语言处理代写NLP代考|Missing Data

您决定如何处理缺失数据取决于具体的数据集。以下是一些处理缺失数据的方法（前三种技术是手动技术，其他技术是算法）：

用均值/中值/众数替换缺失数据
推断（“估算”）缺失数据的值
删除缺少数据的行
隔离森林（基于树的算法）
最小协方差行列式
局部异常因子
一类 SVM（支持向量机）
一般来说，用零替换缺失的数值是一种冒险的选择：如果特征的值介于 1,000 和 5,000 之间，这个值显然是不正确的。对于具有数值的特征，用平均值代替缺失值优于零值（除非平均值等于零）；还可以考虑使用中值。对于分类数据，请考虑使用众数替换缺失值。

如果您不确定是否可以估算“合理”值，请考虑删除缺失值的行，然后使用估算值和删除的行训练模型。

删除具有缺失值的行后可能出现的一个问题是生成的数据集太小。在这种情况下，考虑使用本章稍后讨论的 SMOTE 来生成合成数据。

机器学习代写|自然语言处理代写NLP代考|Anomalies and Outliers

简而言之，异常值是超出“正常”值范围的异常数据值。例如，一个人的身高（以厘米计）通常在 30 厘米到 250 厘米之间。因此，高度为5厘米或高度为500厘米的数据点（例如电子表格中的一行数据）是异常值。这些异常值的后果不太可能涉及重大的财务或物理损失（尽管它们可能会对训练模型的准确性产生不利影响）。

异常也在“正常”值范围之外（就像异常值一样），它们通常比异常值更成问题：异常可能比异常值产生更严重的后果。例如，考虑一个住在加利福尼亚的人突然获得信用的场景

在纽约买卡。如果这个人正在度假（或出差），那么购买是异常值（它超出了典型的购买模式），但这不是问题。但是，如果该人在购买信用卡时在加利福尼亚，那么很可能是信用卡欺诈以及异常情况。

不幸的是，没有简单的方法来决定如何处理数据集中的异常和异常值。尽管您可以删除包含异常值的行，但请记住，这样做可能会剥夺数据集（因此训练模型）的有价值信息。您可以尝试修改数据值（如下所述），但同样，这可能会导致训练模型中的错误推断。另一种可能性是使用包含异常和异常值的数据集训练模型，然后使用已删除异常和异常值的数据集训练模型。比较这两个结果，看看您是否可以推断出有关异常和异常值的任何有意义的信息。

机器学习代写|自然语言处理代写NLP代考|Outlier Detection

尽管保留或删除异常值是您自己的决定，但有一些技术可以帮助您检测数据集中的异常值。本节包含一些技术的简短列表，以及非常简短的描述和附加信息的链接。

也许修剪是最简单的技术（除了丢弃异常值），它涉及删除特征值在 $5 \%$ 上限或 $5 \%$ 下限范围内的行。Winsorizing 数据是对修剪的改进：将顶部 $5 \%$ 范围内的值设置为等于第 95 个百分位数的最大值，并将底部 $5 \%$ 范围内的值设置为等于第 5 个百分位数的最小值百分位。5%范围或较低的5%范围。Winsorizing 数据是对修剪的改进：将前5%范围内的值设置为等于第 95 个百分位数中的最大值，并将底部5%范围内的值设置为等于第 5 个百分位数中的最小值。

最小协方差行列式是一种基于协方差的技术，使用此技术的基于 Python 的代码示例可在线获得：
https://scikit-learn.org/stable/modules/outlier_detection.html。
局部异常因子 (LOF) 技术是一种无监督技术，通过 kNN（k 最近邻）算法计算局部异常分数。使用 LOF 的文档和短代码示例可在线获取：
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors。LocalOutlierFactor.html。

另外两种技术涉及 Huber 和 Ridge 类，它们都包含在 Sklearn 中。Huber 误差对

异常值，因为它是通过线性损失计算的，类似于 MAE（平均绝对误差）。在线提供了一个比较 Huber 和 Ridge 的代码示例：
https://scikit-learn.org/stable/auto_examples/linear_model/plot_huber_ts_ridge.html。

您还可以探索 Theil-Sen 估计器和 RANSAC，它们对异常值“稳健”：
https://scikit-learn.org/stable/auto_examples/linear_model/plot_theilsen。html 和
https://en.wikipedia.org/wiki/Random_sample_consensus。
以下站点讨论了四种异常值检测算法：
https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection。html。

另一种情况涉及“本地”异常值。例如，假设您使用 kMeans（或其他一些聚类算法）并确定某个值相对于其中一个聚类是异常值。虽然此值不一定是“绝对”异常值，但检测此类值可能对您的用例很重要。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Standardization

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Standardization

The standardization technique involves finding the mean mu and the standard deviation sigma, and then mapping each $x i$ value to (xi-mu)/sigma. Recall the following formulas:
$\mathrm{mu}=[\operatorname{SUM}(\mathrm{x})] / \mathrm{n}$
$\operatorname{variance}(\mathrm{x})=[$ SUM $(\mathrm{x}-\mathrm{xbar}) *(\mathrm{x}-\mathrm{xbar})] / \mathrm{n}$
sigma $=\operatorname{sqrt}($ variance $)$
As a simple illustration of standardization, suppose that the random variable $x$ has the values ${-1,0,1}$. Then $m u$ and sigma are calculated as follows:
mu $\quad=($ SUM $x i) / n=(-1+0+1) / 3=0$
variance $=\left[\mathrm{SUM}(\mathrm{xi}-\mathrm{mu})^{\wedge} 2\right] / \mathrm{n}$
$=\left[(-1-0)^{\wedge} 2+(0-0)^{\wedge} 2+(1-0)^{\wedge} 2\right] / 3$
$=2 / 3$
sigma $=\operatorname{sqrt}(2 / 3)=0.816$ (approximate value)
Hence, the standardization of ${-1,0,1}$ is ${-1 / 0.816,0 / 0.816$,
$1 / 0.816}$, which in turn equals the set of values ${-1.2254,0,1.2254}$.
As another example, suppose that the random variable $x$ has the values
${-6,0,6}$. Then mu and sigma are calculated as follows:
$m u=(\mathrm{SUM} \mathrm{xi}) / \mathrm{n}=(-6+0+6) / 3=0$
variance $=\left[S U M(x i-m u)^{\wedge} 2\right] / \mathrm{n}$
$=\left[(-6-0)^{\wedge} 2+(0-0)^{\wedge} 2+(6-0)^{\wedge} 2\right] / 3$
$=72 / 3$
$=24$
sigma $=\operatorname{sqrt}(24)=4.899$ (approximate value)

Hence, the standardization of ${-6,0,6}$ is ${-6 / 4.899,0 / 4.899$, $6 / 4.899}$, which in turn equals the set of values ${-1.2247,0,1.2247}$.
In the preceding two examples, the mean equals 0 in both cases, but the variance and standard deviation are significantly different. The normalization of a set of values always produces a set of numbers between 0 and 1 .

However, the standardization of a set of values can generate numbers that are less than $-1$ and greater than 1 ; this will occur when sigma is less than the minimum value of every term $|\mathrm{mu}-\mathrm{xi}|$, where the latter is the absolute value of the difference between mu and each xi value. In the preceding example, the minimum difference equals 1 , whereas sigma is $0.816$, and therefore the largest standardized value is greater than $1 .$

机器学习代写|自然语言处理代写NLP代考|What to Look for in Categorical Data

This section contains various suggestions for handling inconsistent data values, and you can determine which ones to adopt based on any additional factors that are relevant to your particular task. For example, consider dropping columns that have very low cardinality (equal to or close to 1), as well as numeric columns with zero or very low variance.

Next, check the contents of categorical columns for inconsistent spellings or errors. A good example pertains to the gender category, which can consist of a combination of the following values:
male
Male
female
Female
$\mathrm{m}$
f
$M$
$\mathrm{F}$
The preceding categorical values for gender can be replaced with two categorical values (unless you have a valid reason to retain some of the other values). Moreover, if you are training a model whose analysis involves a single gender, then you need to determine which rows (if any) of a dataset must be excluded. Also check categorical data columns for redundant or missing white spaces.

Check for data values that have multiple data types, such as a numerical column with numbers as numerals and some numbers as strings or objects.

机器学习代写|自然语言处理代写NLP代考|Mapping Categorical Data to Numeric Values

Character data is often called categorical data, examples of which include people’s names, home or work addresses, and email addresses. Many types of categorical data involve short lists of values. For example, the days of the week and the months in a year involve seven and twelve distinct values, respectively. Notice that the days of the week have a relationship: For example, each day has a previous day and a next day. However, the colors of an automobile are independent of each other: the color red is not “better” or “worse” than the color blue.

There are several well-known techniques for mapping categorical values to a set of numeric values. A simple example where you need to perform this conversion involves the gender feature in the Titanic dataset. This feature is one of the relevant features for training a machine learning model. The gender feature has ${\mathbf{M}, \mathrm{F}}$ as its set of possible values. As you will see later in this chapter, Pandas makes it very easy to convert the set of values ${M, F}$ to the set of values ${0,1}$.

Another mapping technique involves mapping a set of categorical values to a set of consecutive integer values. For example, the set {Red, Green, Blue} can be mapped to the set of integers $[0,1,2}$. The set ${$ Male, Female $}$ can be mapped to the set of integers ${0,1}$. The days of the week can be mapped to ${0,1,2,3,4,5,6}$. Note that the first day of the week depends on the country: In some cases it’s Sunday, and in other cases it’s Monday.

Another technique is called one-hot encoding, which converts each value to a vector (check Wikipedia if you need a refresher regarding vectors). Thus, {Male, Female} can be represented by the vectors $[1,0]$ and $[0,1]$, and the colors {Red, Green, Blue} can be represented by the vectors $[1,0,0]$, $[0,1,0]$, and $[0,0,1]$. If you vertically “line up” the two vectors for gender, they form a $2 \times 2$ identity matrix, and doing the same for the colors will form a $3 \times 3$ identity matrix.

If you vertically “line up” the two vectors for gender, they form a $2 \times 2$ identity matrix, and doing the same for the colors will form a $3 \times 3$ identity matrix, as shown here:
$$
[1,0,0]
$$
$[0,1,0]$
$[0,0,1]$

NLP代考

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Standardization

标准化技术涉及找到均值 mu 和标准差 sigma，然后映射每个X一世值为 (xi-mu)/sigma。回忆以下公式：
米在=[和⁡(X)]/n
方差⁡(X)=[和(X−Xb一个r)∗(X−Xb一个r)]/n
西格玛=平方⁡(方差)
作为标准化的简单说明，假设随机变量X有价值观−1,0,1. 然后米在和 sigma 的计算如下：
mu=(和X一世)/n=(−1+0+1)/3=0
方差=[小号在米(X一世−米在)∧2]/n
=[(−1−0)∧2+(0−0)∧2+(1−0)∧2]/3
=2/3
西格玛=平方⁡(2/3)=0.816（近似值）
因此，标准化−1,0,1是−1/0.816,0/0.816$,$1/0.816, 这又等于一组值−1.2254,0,1.2254.
再举一个例子，假设随机变量X有价值观
−6,0,6. 然后 mu 和 sigma 计算如下：
米在=(小号在米X一世)/n=(−6+0+6)/3=0
方差=[小号在米(X一世−米在)∧2]/n
=[(−6−0)∧2+(0−0)∧2+(6−0)∧2]/3
=72/3
=24
西格玛=平方⁡(24)=4.899（近似值）

因此，标准化−6,0,6是−6/4.899,0/4.899$,$6/4.899, 这又等于一组值−1.2247,0,1.2247.
在前面的两个示例中，均值在这两种情况下都等于 0，但方差和标准差显着不同。一组值的标准化总是产生一组介于 0 和 1 之间的数字。

但是，一组值的标准化可以生成小于−1并且大于 1 ；当 sigma 小于每一项的最小值时会发生这种情况|米在−X一世|，其中后者是 mu 和每个 xi 值之间的差的绝对值。在前面的示例中，最小差值等于 1 ，而 sigma 是0.816，因此最大的标准化值大于1.

机器学习代写|自然语言处理代写NLP代考|What to Look for in Categorical Data

本节包含处理不一致数据值的各种建议，您可以根据与您的特定任务相关的任何其他因素来确定采用哪些建议。例如，考虑删除基数非常低（等于或接近 1）的列，以及方差为零或非常低的数值列。

接下来，检查分类列的内容是否存在拼写不一致或错误。一个很好的例子与性别类别有关，它可以由以下值的组合组成：
男性
男性
女性
女性
米
F
米
F
前面的性别分类值可以替换为两个分类值（除非您有正当理由保留其他一些值）。此外，如果您正在训练一个分析涉及单一性别的模型，那么您需要确定必须排除数据集的哪些行（如果有）。还要检查分类数据列是否有多余或缺失的空格。

检查具有多种数据类型的数据值，例如将数字作为数字和一些数字作为字符串或对象的数字列。

机器学习代写|自然语言处理代写NLP代考|Mapping Categorical Data to Numeric Values

字符数据通常称为分类数据，其示例包括人名、家庭或工作地址以及电子邮件地址。许多类型的分类数据都涉及简短的值列表。例如，一周中的几天和一年中的月份分别涉及七个和十二个不同的值。请注意，一周中的天是有关系的：例如，每一天都有前一天和后一天。然而，汽车的颜色是相互独立的：红色并不比蓝色“好”或“差”。

有几种众所周知的技术可以将分类值映射到一组数值。需要执行此转换的一个简单示例涉及 Titanic 数据集中的性别特征。此功能是训练机器学习模型的相关功能之一。性别特征有米,F作为它的一组可能值。正如您将在本章后面看到的那样，Pandas 使转换一组值变得非常容易米,F到一组值0,1.

另一种映射技术涉及将一组分类值映射到一组连续整数值。例如，集合 {Red, Green, Blue} 可以映射到整数集合[0,1,2}[0,1,2}. 套装$米一个l和,F和米一个l和$可以映射到整数集0,1. 星期几可以映射到0,1,2,3,4,5,6. 请注意，一周的第一天取决于国家/地区：在某些情况下是星期日，在其他情况下是星期一。

另一种技术称为单热编码，它将每个值转换为向量（如果您需要有关向量的复习，请查看 Wikipedia）。因此，{Male, Female} 可以由向量表示[1,0]和[0,1], 颜色 {Red, Green, Blue} 可以用向量表示[1,0,0], [0,1,0]，和[0,0,1]. 如果你垂直“排列”这两个性别向量，它们会形成一个2×2单位矩阵，对颜色做同样的事情会形成一个3×3单位矩阵。

如果你垂直“排列”这两个性别向量，它们会形成一个2×2单位矩阵，对颜色做同样的事情会形成一个3×3单位矩阵，如下所示：

[1,0,0]
[0,1,0]
[0,0,1]

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|PREPARING DATASETS

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Discrete Data Versus Continuous Data

As a simple rule of thumb: discrete data is a set of values that can be counted, whereas continuous data must be measured. Discrete data can reasonably fit in a drop-down list of values, but there is no exact value for making such a determination. One person might think that a list of 500 values is discrete, whereas another person might think it’s continuous.

For example, the list of provinces of Canada and the list of states of the United States are discrete data values, but is the same true for the number of countries in the world (roughly 200 ) or for the number of languages in the world (more than 7,000$)$ ?

Values for temperature, humidity, and barometric pressure are considered continuous. Currency is also treated as continuous, even though there is a measurable difference between two consecutive values. The smallest

unit of currency for U.S. currency is one penny, which is $1 / 100$ th of a dollar (accounting-based measurements use the “mil,” which is $1 / 1,000$ th of a dollar).
Continuous data types can have subtle differences. For example, someone who is 200 centimeters tall is twice as tall as someone who is 100 centimeters tall; the same is true for 100 kilograms versus 50 kilograms. However, temperature is different: 80 degrees Fahrenheit is not twice as hot as 40 degrees Fahrenheit.

Furthermore, keep in mind that the meaning of the word “continuous” in mathematics is not necessarily the same as continuous in machine learning. In the former, a continuous variable (let’s say in the 2D Euclidean plane) can have an uncountably infinite number of values. A feature in a dataset that can have more values than can be reasonably displayed in a drop-down list is treated as though it’s a continuous variable.

For instance, values for stock prices are discrete: they must differ by at least a penny (or some other minimal unit of currency), which is to say, it’s meaningless to say that the stock price changes by one-millionth of a penny. However, since there are so many possible stock values, it’s treated as a continuous variable. The same comments apply to car mileage, ambient temperature, and barometric pressure.

机器学习代写|自然语言处理代写NLP代考|“Binning” Continuous Data

Binning refers to subdividing a set of values into multiple intervals, and then treating all the numbers in the same interval as though they had the same value.

As a simple example, suppose that a feature in a dataset contains the age of people in a dataset. The range of values is approximately between 0 and 120 , and we could bin them into 12 equal intervals, where each consists of 10 values: 0 through 9,10 through 19,20 through 29 , and so forth.

However, partitioning the values of people’s ages as described in the preceding paragraph can be problematic. Suppose that person A, person B, and person C are 29,30 , and 39 , respectively. Then person $A$ and person $B$ are probably more similar to each other than person $B$ and person C, but because of the way in which the ages are partitioned, $B$ is classified as closer to $C$ than to A. In fact, binning can increase Type I errors (false positive) and Type II errors (false negative), as discussed in this blog post (along with some alternatives to binning):
https://medium.com/@peterflom/why-binning-continuous-data-is-almostalways-a-mistake-ad0b3ald141f.

As another example, using quartiles is even more coarse-grained than the earlier age-related binning example. The issue with binning pertains to the consequences of classifying people in different bins, even though they are in close proximity to each other. For instance, some people struggle financially because they earn a meager wage, and they are disqualified from financial assistance because their salary is higher than the cutoff point for receiving any assistance.

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Normalization

A range of values can vary significantly, and it’s important to note that they often need to be scaled to a smaller range, such as values in the range $[-1,1]$ or $[0,1]$, which you can do via the tanh function or the sigmoid function, respectively.

For example, measuring a person’s height in terms of meters involves a range of values between $0.50$ meters and $2.5$ meters (in the vast majority of cases), whereas measuring height in terms of centimeters ranges between 50 centimeters and 250 centimeters: these two units differ by a factor of 100 . A person’s weight in kilograms generally varies between 5 kilograms and 200 kilograms, whereas measuring weight in grams differs by a factor of 1,000 . Distances between objects can be measured in meters or in kilometers, which also differ by a factor of 1,000 .

In general, use units of measure so that the data values in multiple features belong to a similar range of values. In fact, some machine learning algorithms require scaled data, often in the range of $[0,1]$ or $[-1,1]$. In addition to the tanh and sigmoid function, there are other techniques for scaling data, such as standardizing data (think Gaussian distribution) and normalizing data (linearly scaled so that the new range of values is in $[0,1]$ ).

The following examples involve a floating point variable $x$ with different ranges of values that will be scaled so that the new values are in the interval $[0,1]$.

Example 1: If the values of $x$ are in the range $[0,2]$, then $x / 2$ is in the range $[0,1]$.
Example 2: If the values of $x$ are in the range $[3,6]$, then $x-3$ is in the range $[0,3]$, and $(x-3) / 3$ is in the range $[0,1]$.
Example 3: If the values of $x$ are in the range $[-10,20]$, then $x+10$ is in the range $[0,30]$, and $(x+10) / 30$ is in the range of $[0,1]$.

NLP代考

机器学习代写|自然语言处理代写NLP代考|Discrete Data Versus Continuous Data

作为一个简单的经验法则：离散数据是一组可以计数的值，而必须测量连续数据。离散数据可以合理地放入值的下拉列表中，但没有确切的值可以做出这样的决定。一个人可能认为 500 个值的列表是离散的，而另一个人可能认为它是连续的。

例如，加拿大的省列表和美国的州列表是离散数据值，但对于世界上的国家数量（大约 200 个）或世界上的语言数量（超过 7,000) ?

温度、湿度和大气压力的值被认为是连续的。货币也被视为连续的，即使两个连续值之间存在可测量的差异。最小的

美元的货币单位是一便士，即1/100千分之一美元（基于会计的测量使用“mil”，即1/1,000一美元）。
连续数据类型可能有细微的差别。例如，200 厘米高的人是 100 厘米高的人的两倍；100 公斤对 50 公斤也是如此。但是，温度不同：80 华氏度不是 40 华氏度的两倍。

此外，请记住，数学中“连续”一词的含义不一定与机器学习中的连续相同。在前者中，连续变量（假设在 2D 欧几里得平面中）可以有无数个值。数据集中的特征值可能超过下拉列表中合理显示的值，被视为连续变量。

例如，股票价格的价值是离散的：它们必须至少相差一美分（或其他一些最小的货币单位），也就是说，说股票价格变化百万分之一美分是没有意义的。然而，由于有很多可能的股票值，它被视为一个连续变量。同样的评论适用于汽车里程、环境温度和大气压力。

机器学习代写|自然语言处理代写NLP代考|“Binning” Continuous Data

分箱是指将一组值细分为多个区间，然后将同一区间中的所有数字视为具有相同的值。

举个简单的例子，假设数据集中的一个特征包含数据集中人的年龄。值的范围大约在 0 到 120 之间，我们可以将它们分成 12 个相等的间隔，其中每个包含 10 个值：0 到 9,10 到 19,20 到 29 等等。

但是，如前一段所述划分人们的年龄值可能会出现问题。假设人 A、人 B 和人 C 分别是 29,30 和 39 。那么人一个和人乙可能比人更相似乙和 C 人，但由于时代划分的方式，乙被归类为更接近C比 A。事实上，分箱会增加 I 型错误（误报）和 II 型错误（误报），如本博文中所述（以及分箱的一些替代方案）：
https ://medium.com/@ peterflom/why-binning-continuous-data-is-almost always-a-mistake-ad0b3ald141f。

作为另一个示例，使用四分位数甚至比早期的与年龄相关的分箱示例更粗粒度。分箱的问题与将人分类在不同的箱中的后果有关，即使他们彼此非常接近。例如，有些人因为工资微薄而陷入财务困境，他们因工资高于接受任何援助的临界点而被取消获得经济援助的资格。

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Normalization

值的范围可能会有很大的不同，需要注意的是，它们通常需要缩放到更小的范围，例如范围内的值[−1,1]或者[0,1]，您可以分别通过 tanh 函数或 sigmoid 函数来完成。

例如，以米为单位测量一个人的身高涉及到以下值的范围：0.50米和2.5米（在绝大多数情况下），而以厘米为单位的高度测量范围在 50 厘米和 250 厘米之间：这两个单位相差 100 倍。一个人的公斤体重通常在 5 公斤到 200 公斤之间变化，而以克为单位的体重则相差 1,000 倍。物体之间的距离可以以米或公里为单位测量，它们也相差 1,000 倍。

通常，使用度量单位，以便多个要素中的数据值属于相似的值范围。事实上，一些机器学习算法需要缩放数据，通常在[0,1]或者[−1,1]. 除了 tanh 和 sigmoid 函数之外，还有其他用于缩放数据的技术，例如标准化数据（想想高斯分布）和标准化数据（线性缩放，以便新的值范围在[0,1] ).

以下示例涉及浮点变量X具有不同范围的值，这些值将被缩放，以便新值在区间内[0,1].

示例 1：如果X在范围内[0,2]，然后X/2在范围内[0,1].
示例 2：如果X在范围内[3,6]，然后X−3在范围内[0,3]，和(X−3)/3在范围内[0,1].
示例 3：如果X在范围内[−10,20]，然后X+10在范围内[0,30]，和(X+10)/30是在范围内[0,1].

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|Working with Data

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|WHAT ARE DATASETS

In simple terms, a dataset is a source of data (such as a text file) that contains rows and columns of data. Each row is typically called a “data point,” and each column is called a “feature.” A dataset can be in any form: CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS (Relational Database Management System), a document

in a NoSQL database, or the output from a Web service. Someone needs to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset.

A dataset can vary from very small (a couple of features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain, then you might struggle to determine the most important features in a large dataset. In this situation, you might need a domain expert who understands the importance of the features, their interdependencies (if any), and whether the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features. For example, PCA (Principal Component Analysis) is one such algorithm, which is discussed in more detail later in this chapter.

机器学习代写|自然语言处理代写NLP代考|Data Preprocessing

Data preprocessing is the initial step that involves validating the contents of a dataset, which involves making decisions about missing and incorrect data values such as

dealing with missing data values
cleaning “noisy” text-based data
removing HTML tags
removing emoticons
dealing with emojis/emoticons
filtering data
grouping data
handling currency and date formats (il8n)
Cleaning data is an important initial task that involves removing unwanted data as well as handling missing data. In the case of text-based data, you might need to remove HTML tags, punctuation, and so forth. In the case of numeric data, it’s less likely (though still possible) that alphabetic characters are mixed together with numeric data. However, a dataset with numeric features might have incorrect values or missing values (discussed later). In addition, calculating the minimum, maximum, mean, median, and standard deviation of the values of a feature obviously pertain only to numeric values.
After the preprocessing step is completed, data wrangling is performed, which refers to transforming data into a new format. You might have to combine data from multiple sources into a single dataset. For example, you might

need to convert between different units of measurement (such as date formats or currency values) so that the data values can be represented in a consistent manner in a dataset.

Currency and date values are part of $i 18 n$ (internationalization), whereas l10n (localization) targets a specific nationality, language, or region. Hardcoded values (such as text strings) can be stored as resource strings in a file that’s often called a resource bundle, where each string is referenced via a code. Each language has its own resource bundle.

机器学习代写|自然语言处理代写NLP代考|DATA TYPES

Explicit data types exist in many programming languages such as $\mathrm{C}, \mathrm{C}++$, Java, and TypeScript. Some programming languages, such as JavaScript and awk, do not require initializing variables with an explicit type: the type of a variable is inferred dynamically via an implicit type system (i.e., one that is not directly exposed to a developer).

In machine learning, datasets can contain features that have different types of data, such as a combination of one or more of the following:

numeric data (integer/floating point and discrete/continuous)
character/categorical data (different languages)
date-related data (different formats)
currency data (different formats)
binary data (yes/no, 0/1, and so forth)
nominal data (multiple unrelated values)
ordinal data (multiple and related values)
Consider a dataset that contains real estate data, which can have as many as thirty columns (or even more), often with the following features:
the number of bedrooms in a house: numeric value and a discrete value
the number of square feet: a numeric value and (probably) a continuous value
the name of the city: character data
the construction date: a date value
the selling price: a currency value and probably a continuous value
the “for sale” status: binary data (either “yes” or “no”)
An example of nominal data is the seasons in a year: although many countries have four distinct seasons, some countries have only two distinct seasons.

NLP代考

机器学习代写|自然语言处理代写NLP代考|WHAT ARE DATASETS

简单来说，数据集是包含数据行和列的数据源（例如文本文件）。每行通常称为“数据点”，每列称为“特征”。数据集可以是任何形式：CSV（逗号分隔值）、TSV（制表符分隔值）、Excel 电子表格、RDMBS（关系数据库管理系统）中的表格、文档

在 NoSQL 数据库中，或来自 Web 服务的输出。有人需要分析数据集以确定哪些特征最重要，哪些特征可以安全地忽略，以便使用给定数据集训练模型。

数据集可以从非常小（几个特征和 100 行）到非常大（超过 1,000 个特征和超过一百万行）不等。如果您不熟悉问题域，那么您可能很难确定大型数据集中最重要的特征。在这种情况下，您可能需要了解特征重要性、它们的相互依赖性（如果有）以及特征的数据值是否有效的领域专家。此外，还有一些算法（称为降维算法）可以帮助您确定最重要的特征。例如，PCA（主成分分析）就是这样一种算法，本章稍后将对此进行更详细的讨论。

机器学习代写|自然语言处理代写NLP代考|Data Preprocessing

数据预处理是涉及验证数据集内容的初始步骤，其中涉及对缺失和不正确的数据值做出决策，例如

处理缺失的数据值
清理“嘈杂”的基于文本的数据
删除 HTML 标签
删除表情符号
处理表情符号/表情符号
过滤数据
分组数据
处理货币和日期格式 (il8n)
清理数据是一项重要的初始任务，包括删除不需要的数据以及处理丢失的数据。对于基于文本的数据，您可能需要删除 HTML 标记、标点符号等。在数字数据的情况下，字母字符与数字数据混合在一起的可能性较小（尽管仍然可能）。但是，具有数字特征的数据集可能具有不正确的值或缺失值（稍后讨论）。此外，计算特征值的最小值、最大值、平均值、中值和标准差显然只与数值有关。
预处理步骤完成后，进行数据整理，即将数据转换为新的格式。您可能必须将来自多个来源的数据合并到一个数据集中。例如，您可能

需要在不同的计量单位（例如日期格式或货币值）之间进行转换，以便数据值可以在数据集中以一致的方式表示。

货币和日期值是一世18n（国际化），而 l10n（本地化）针对特定的国籍、语言或地区。硬编码值（例如文本字符串）可以作为资源字符串存储在通常称为资源包的文件中，其中每个字符串都通过代码引用。每种语言都有自己的资源包。

机器学习代写|自然语言处理代写NLP代考|DATA TYPES

显式数据类型存在于许多编程语言中，例如C,C++、Java 和 TypeScript。一些编程语言，例如 JavaScript 和 awk，不需要使用显式类型初始化变量：变量的类型是通过隐式类型系统（即不直接暴露给开发人员的系统）动态推断的。

在机器学习中，数据集可以包含具有不同类型数据的特征，例如以下一项或多项的组合：

数字数据（整数/浮点和离散/连续）
字符/分类数据（不同语言）
日期相关数据（不同格式）
货币数据（不同格式）
二进制数据（是/否、0/1 等）
标称数据（多个不相关的值）
序数数据（多个相关值）
考虑一个包含房地产数据的数据集，该数据集可以有多达 30 列（甚至更多），通常具有以下特征：
房屋中的卧室数量：数值和离散值
平方英尺数：一个数值和（可能）一个连续值
城市名称：人物资料
施工日期：日期值
售价：货币价值，可能是连续价值
“待售”状态：二元数据（“是”或“否”）
名义数据的一个例子是一年中的季节：尽管许多国家有四个不同的季节，但有些国家只有两个不同的季节。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写