5 月 2022 - 统计代写答疑辅导

月度归档： 2022 年 5 月

机器学习代写|自然语言处理代写NLP代考|JAPANESE GRAMMAR

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

statistics-lab™ 为您的留学生涯保驾护航在代写自然语言处理NLP方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写自然语言处理NLP代写方面经验极为丰富，各种代写自然语言处理NLP相关的作业也就用不着说。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Japanese Postpositions

Instead of prepositions, Japanese uses postpositions (which can occur multiple times in a sentence). Here are some common Japanese postpositions that are written in Romanji:

Ka (a marker for a question)
Wa (the topic of a sentence)
Ga (the subject of a sentence)
$\mathrm{O}$ (direct object)
To (can mean “for” and “and”)
Ni (physical motion toward something)
E (toward something)
The particle $k a$ at the end of a sentence in Japanese indicates a question. A simple example of $k a$ is the Romanji sentence Nan desu ka, which means “What is it?”

An example of wa is the following sentence: Watashi wa Nihon jin desu, which means “As for me, I’m Japanese.” By contrast, the sentence Watashi ga Nihon jin desu, which means “It is I (not somebody else) who is Japanese.”
As you can see, Japanese makes a distinction between the topic of a sentence (with $w a$ ) versus the subject of a sentence (with $g a$ ). A Japanese sentence can contain both particles $w a$ and $g a$, with the following twist: if a negative fact is expressed about the noun that precedes $g a$, then $g a$ is replaced with $w a$ and the main verb is written in the negative form. For example, the Romanji sentence “I still have not studied Kanji” is translated into Hiragana as follows:
Watashi wa kanji wa mada benkyou shite imasen.

机器学习代写|自然语言处理代写NLP代考|Ambiguity in Japanese Sentences

Since Japanese does not pluralize nouns, the same word is used for singular as well as plural, which requires contextual information to determine the exact meaning of a Japanese sentence. As a simple illustration, which is discussed

in more detail later in this chapter under the topic of tokenization, here is a Japanese sentence written in Romanji, followed by Hiragana and Kanji (the second and third sentences are from Google Translate):
Watashi wa tomodachi ni hon o agemashita
$\mathrm{~ क た L ~ क ~ ともだち ~ に ~ ほ h ~ お}$
友迲に本をかげた
The preceding sentence can mean any of the following, and the correct interpretation depends on the context of a conversation:

I gave a book to a friend.
I gave a book to friends.
I gave books to a friend.
I gave books to friends.
Moreover, the context for the words “friend” and “friends” in the Japanese sentence is also ambiguous: they do not indicate whose friends (mine, yours, his, or hers). In fact, the following Japanese sentence is also grammatically correct and ambiguous:
Tomodachi ni hon o agemashita
The preceding sentence does not specify who gave a book (or books) to a friend (or friends), but its context will be clear during a conversation. Incidentally, Japanese people often omit the subject pronoun (unless the sentence becomes ambiguous), so it’s more common to see the second sentence (i.e., without Watashi wa) instead of the first Romanji sentence.

Contrast the earlier Japanese sentence with its counterpart in the romance languages Italian, Spanish, French, Portuguese, and German (some accent marks are missing for some words):

Italian: Ho dato un libro a mio amico.
Spanish: [Yo] Le di un libro a mi amigo.
Portuguese: Eu dei um livro para meu amigo.
French: Jai donne un livre au mon ami.
German. Ich habe ein Buch dem Freund gegeben.
Notice that the Italian and French sentences use a compound verb whose two parts are consecutive (adjacent), whereas German uses a compound verb in which the second part (the past participle) is at the end of the sentence. However, the Spanish and Portuguese sentences use the simple past (the preterit) form of the verb “to give.”

机器学习代写|自然语言处理代写NLP代考|Japanese Nominalization

Nominalizers convert verbs (or even entire sentences) into a noun. Nominalizers resemble a “that” clause in English, and they are useful when speaking about an action as a noun. Japanese has two nominalizers: no and koto ga.

The nominalizer $O$ (no) is required with verbs of perception, such as 見 (to see) and 閆 $<$ (to listen). For example, the following sentence mean “I love listening to music”, written in Romanji in the first sentence, followed by a second sentence that contains a mixture of Kanji and Hiragana:
Watashi wa ongaku o kiku no ga daisuki desu
The next three sentences all mean “He loves reading a newspaper,” written in Romanji and then Hiragana and Kanji:
Kare wa shimbun o yomu no ga daisuki desu
$\mathrm{~ カ ั 丸は新間を読}$
彼才 $\mathrm{LmS} ん \mathrm{~ क 読むのか}$
The koto ga nominalizer, which is the other Japanese nominalizer, is used sentences of the form “have you ever …” For example, the following sentence means “Have you (ever) been in Japan?”
日本にレ) $=<\neq \approx-\frac{5}{2} \Rightarrow$

NLP代考

机器学习代写|自然语言处理代写NLP代考|Japanese Postpositions

日语不使用介词，而是使用后置词（可以在一个句子中出现多次）。以下是一些用罗马字书写的常见日语后置词：

Ka（问题的标记）
Wa（一个句子的主题）
嘎（句子的主语）
○（直接宾语）
To（可以表示“for”和“and”）
Ni（朝向某物的物理运动）
E（朝向某物）
粒子ķ一个日语句末表示疑问。一个简单的例子ķ一个是罗马字句 Nan desu ka，意思是“它是什么？”

以下句子是 wa 的一个示例：Watashi wa Nihon jin desu，意思是“至于我，我是日本人”。相比之下，Watashi ga Nihon jin desu 这句话的意思是“我（而不是其他人）是日本人”。
如您所见，日语区分句子的主题（与在一个) 与句子的主语 (withG一个）。一个日语句子可以包含两个粒子在一个和G一个，有以下扭曲：如果对前面的名词表达了否定事实G一个，然后G一个被替换为在一个主要动词写成否定形式。例如，罗马字句子“我还没有学过汉字”被翻译成平假名如下：
Watashi wa kanji wa mada benkyou shite imasen。

机器学习代写|自然语言处理代写NLP代考|Ambiguity in Japanese Sentences

由于日语不复数名词，因此单数和复数都使用同一个词，这需要上下文信息来确定日语句子的确切含义。作为一个简单的说明，讨论

在本章后面的标记化主题下更详细，这里是用罗马字写的日语句子，后面是平假名和汉字（第二和第三句来自谷歌翻译）：
Watashi wa tomodachi ni hon oagemashita
कたकともだちにほお ķ稻田大号 ķ 什么时候和稻田゙血液至何H 哦
前面的
句子可以表示以下任何一种，正确的解释取决于对话的上下文：

我给了朋友一本书。
我给了朋友一本书。
我把书送给了一个朋友。
我把书送给了朋友。
此外，日语句子中“朋友”和“朋友”这两个词的上下文也是模棱两可的：它们不表示谁的朋友（我的、你的、他的或她的）。事实上，下面的日语句子在语法上也是正确的和模棱两可的：
Tomodachi ni hon oagemashita
前面的句子没有具体说明谁把一本书（或几本书）送给了一个朋友（或几个朋友），但它的上下文在对话中会很清楚. 顺便说一句，日本人经常省略主语代词（除非句子变得模棱两可），因此更常见的是看到第二个句子（即没有 Watashi wa）而不是第一个罗马字句子。

将较早的日语句子与浪漫语言意大利语、西班牙语、法语、葡萄牙语和德语中的对应句进行对比（某些单词缺少一些重音符号）：

意大利人：我给了我朋友一本书。
Chinese: [我] 给了我的朋友一本书。
Chinese: 我给了我朋友一本书。
Chinese: 我给了我朋友一本书。
德语。Ich habe ein Buch dem Freund gegeben。
请注意，意大利语和法语句子使用复合动词，其两个部分是连续的（相邻），而德语使用复合动词，其中第二部分（过去分词）位于句子的末尾。然而，西班牙语和葡萄牙语的句子使用动词“to give”的简单过去（preterit）形式。

机器学习代写|自然语言处理代写NLP代考|Japanese Nominalization

名词化器将动词（甚至整个句子）转换为名词。名词化器类似于英语中的“that”从句，在将动作作为名词来谈论时它们很有用。日语有两个名词化词：no 和 koto ga。

名词化器○(no) 用于知觉动词，如见 (to see) 和阎<（听）。例如，以下句子的意思是“我喜欢听音乐”，第一句用罗马字书写，然后是包含汉字和平假名混合的第二句：
Watashi wa ongaku o kiku no ga daisuki desu
接下来的三句都是“他喜欢看报纸”的意思，用罗马字，然后是平假名和汉字：
Kare wa shimbun o yomu no ga daisuki desu
カั丸は新間を読力量○丸牙齿新間的阅读
彼才んक読むのか大号米小号唔 ķ阅读没有什么的蚊子
koto ga 名词化器是另一个日语名词化器，用于“你曾经……”形式的句子例如，下面的句子表示“你（曾经）去过日本吗？”
日本にレ）=<≠≈−52⇒

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

金融工程是使用数学技术来解决金融问题。金融工程使用计算机科学、统计学、经济学和应用数学领域的工具和知识来解决当前的金融问题，以及设计新的和创新的金融产品。

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

术语广义线性模型（GLM）通常是指给定连续和/或分类预测因素的连续响应变量的常规线性回归模型。它包括多元线性回归，以及方差分析和方差分析（仅含固定效应）。

有限元方法代写

有限元方法（FEM）是一种流行的方法，用于数值解决工程和数学建模中出现的微分方程。典型的问题领域包括结构分析、传热、流体流动、质量运输和电磁势等传统领域。

有限元是一种通用的数值方法，用于解决两个或三个空间变量的偏微分方程（即一些边界值问题）。为了解决一个问题，有限元将一个大系统细分为更小、更简单的部分，称为有限元。这是通过在空间维度上的特定空间离散化来实现的，它是通过构建对象的网格来实现的：用于求解的数值域，它有有限数量的点。边界值问题的有限元方法表述最终导致一个代数方程组。该方法在域上对未知函数进行逼近。[1] 然后将模拟这些有限元的简单方程组合成一个更大的方程系统，以模拟整个问题。然后，有限元通过变化微积分使相关的误差函数最小化来逼近一个解决方案。

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

随机分析代写

随机微积分是数学的一个分支，对随机过程进行操作。它允许为随机过程的积分定义一个关于随机过程的一致的积分理论。这个领域是由日本数学家伊藤清在第二次世界大战期间创建并开始的。

时间序列分析代写

随机过程，是依赖于参数的一组随机变量的全体，参数通常是时间。随机变量是随机现象的数量表现，其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值（如1秒，5分钟，12小时，7天，1年），因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中，往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录，以得到其自身发展的规律。

回归分析代写

多元回归分析渐进（Multiple Regression Analysis Asymptotics）属于计量经济学领域，主要是一种数学上的统计分析方法，可以分析复杂情况下各影响因素的数学关系，在自然科学、社会和经济学等多个领域内应用广泛。

MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习和应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|THE COMPLEXITY OF NATURAL LANGUAGES

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|THE COMPLEXITY OF NATURAL LANGUAGES

机器学习代写|自然语言处理代写NLP代考|Word Order in Sentences

As mentioned previously, German and Slavic languages allow for a rearrangement of the words in sentences because those languages support declension, which involves modifying the endings of articles and adjectives in accordance with the grammatical function of those words in a sentence (such as the subject, direct object, and indirect object). Those word endings are loosely comparable to prepositions in English, and sometimes they have the same spelling for different grammatical functions. For example, in German, the article den precedes a masculine noun that is a direct object and also a plural noun that is an indirect object: ambiguity can occur if the singular masculine noun has the same spelling in its plural form.

Alternatively, since English is word order dependent, ambiguity can still arise in sentences, which we have learned to parse correctly without any conscious effort.

Groucho Marx often incorporated ambiguous sentences in his dialogues, such as the following paraphrased examples:

“This morning I shot an elephant in my pajamas. How he got into my pajamas I have no idea.”

“In America, a woman gives birth to a child every fifteen minutes. Somebody needs to find that woman and stop her.”

Now consider the following pair of sentences involving a boy, a mountain, and a telescope:
I saw the boy on the mountain with the telescope.
I saw the boy with the telescope on the mountain.
Human speakers interpret both English sentences as having the same meaning; however, arriving at the same interpretation is less obvious from the standpoint of a purely NLP task. Why does this ambiguity in the preceding example not arise in Russian? The reason is simple: the preposition with is associated with the instrumental case in Russian, whereas on is not the instrumental case, and therefore the nouns have suffixes that indicate the distinction.

机器学习代写|自然语言处理代写NLP代考|Languages and Regional Accents

Accents, slang, and dialects have some common features, but there can be some significant differences. Accents involve modifying the standard pronunciation of words, which can vary significantly in different parts of the same country.

One interesting phenomenon pertains to the southern region of some countries (in the northern hemisphere), which tend to have a more “relaxed” pronunciation compared to the northern region of that country. For example, some people in the southeastern United States speak with a so-called “drawl,” whereas newscasters will often speak with a midwestern pronunciation, which is considered a neutral pronunciation. The same is true of people in Tokyo, who often speak Japanese with a “flat” pronunciation (which is also true of Japanese newscasters on NHK), versus people from the Kansai region (Kyoto, Kobe, and Osaka) of Japan, who vary the tone and emphasis of Japanese words.

Regional accents can also involve modifying the meaning of words in ways that are specific to the region in question. For example, Texans will say “I’m fixing to graduate this year” whereas people from other parts of the United States would say “going” instead of “fixing.” In France, Parisians are unlikely to say Il faut fatiguer la salade (“it’s necessary to toss the salad”), whereas this sentence is much more commonplace in southern France. (The English word “fatigue” is derived from the French verb fatiguer)

机器学习代写|自然语言处理代写NLP代考|What about Verbs?

Verbs exist in every written language, and they undergo conjugation that reflects their tense and mood in a sentence. Such languages have an overlapping set of verb tenses, but there are differences. For instance, Portuguese has a future perfect subjunctive, as does Spanish (but it’s almost never used in spoken form), whereas these verb forms do not exist in English. English verb tenses (in the indicative mood) can include:

present
present perfect
present progressive
present perfect progressive
preterite (simple past)
past perfect
past progressive
past perfect progressive
future tense
future perfect
future progressive
future perfect progressive (does not exist in Italian)

Here are some examples of English sentences that illustrate (most of) the preceding verb forms:

I read a book.
I have read a book.
I am reading a book.
I have been reading a book.
I read a book.
I have read a book.
I had been reading a book.
I will read a book.
I will have read a book.
I will be reading a book.
At 6 p.m., I will have been reading a book for 3 hours.
Verb moods can be indicative (as shown in the preceding list), subjunctive (discussed soon), and conditional (“I would go but I have work to do”). In English, subjunctive verb forms can include the present subjunctive (“I insist that he do the task”), the past subjunctive (“If I were you”), and the pluperfect subjunctive (“Had I but known …”). Interestingly, Portuguese also provides a future perfect subjunctive verb form; Spanish also has this verb form but it’s never used in conversation.

Interestingly (from a linguistic perspective, at least), there are modern languages, such as Mandarin, that have only one verb tense: they rely on other words in a sentence (such as time adverbs or aspect particles) to convey the time frame. Such languages would express the present, the past, and the future in a form that is comparable to the following:

“I read a book now.”
“I read a book yesterday.”
“I read a book tomorrow.”

NLP代考

机器学习代写|自然语言处理代写NLP代考|Word Order in Sentences

如前所述，德语和斯拉夫语言允许重新排列句子中的单词，因为这些语言支持变格，这涉及根据句子中这些单词的语法功能（例如主语，直接宾语和间接宾语）。这些词尾与英语中的介词大致相当，有时它们对于不同的语法功能具有相同的拼写。例如，在德语中，冠词 den 在作为直接宾语的阳性名词和作为间接宾语的复数名词之前：如果单数阳性名词在其复数形式中具有相同的拼写，则会出现歧义。

或者，由于英语依赖于词序，句子中仍然会出现歧义，我们已经学会了在没有任何有意识的努力下正确解析。

格鲁乔·马克思经常在他的对话中加入模棱两可的句子，例如以下转述的例子：

“今天早上我穿着睡衣射了一头大象。我不知道他是怎么穿上我的睡衣的。”

“在美国，每十五分钟就有一个女人生一个孩子。需要有人找到那个女人并阻止她。”

现在考虑以下涉及男孩、山和望远镜的句子：
我看到山上的男孩拿着望远镜。
我在山上看到那个拿着望远镜的男孩。
人类说话者将两个英语句子解释为具有相同的含义；然而，从纯粹的 NLP 任务的角度来看，得出相同的解释并不那么明显。为什么前面例子中的这种歧义在俄语中没有出现？原因很简单：在俄语中，介词with与器格有关，而on不是器格，因此名词带有表示区别的后缀。

机器学习代写|自然语言处理代写NLP代考|Languages and Regional Accents

口音、俚语和方言具有一些共同特征，但也可能存在一些显着差异。口音涉及修改单词的标准发音，这在同一国家的不同地区可能会有很大差异。

一个有趣的现象与一些国家的南部地区（北半球）有关，与该国的北部地区相比，这些地区的发音往往更“轻松”。例如，美国东南部的一些人用所谓的“拖长”说话，而新闻播音员通常会用中西部发音说话，这被认为是中性发音。东京人的情况也是如此，他们说日语时经常带有“扁平”的发音（NHK 上的日本新闻播音员也是如此），而日本关西地区（京都、神户和大阪）的人则各不相同日语单词的语气和重点。

地区口音还可能涉及以特定于相关地区的方式修改单词的含义。例如，德州人会说“我准备今年毕业”，而美国其他地区的人会说“去”而不是“固定”。在法国，巴黎人不太可能说 Il faut faker la Salade（“必须扔沙拉”），而这句话在法国南部更为常见。（英文单词“fatigue”来源于法语动词failer）

机器学习代写|自然语言处理代写NLP代考|What about Verbs?

动词存在于每一种书面语言中，它们经过变位反应，反映了句子中的时态和情绪。这些语言有一组重叠的动词时态，但也有区别。例如，葡萄牙语有一个将来完成的虚拟语气，西班牙语也是如此（但它几乎从未以口语形式使用），而这些动词形式在英语中不存在。英语动词时态（指示语气）可以包括：

当下
现在完美
现在进行
现在完成进行时
preterite （简单过去）
过去完成时
过去进步
过去完成进行时
将来时
未来完美
未来进步
将来完成进行时（意大利语中不存在）

以下是一些说明（大部分）上述动词形式的英语句子示例：

我读了一本书。
我读过一本书。
我正在读一本书。
我一直在看书。
我读了一本书。
我读过一本书。
我一直在看书。
我会读一本书。
我会读一本书。
我会读一本书。
下午 6 点，我会读 3 个小时的书。
动词语气可以是指示性的（如前面的列表所示）、虚拟语气（很快会讨论）和条件性的（“我会去，但我有工作要做”）。在英语中，虚拟语气动词形式可以包括现在虚拟语气（“我坚持他做任务”）、过去虚拟语气（“如果我是你”）和过去完成虚拟语气（“如果我知道……”）。有趣的是，葡萄牙语还提供了将来完成的虚拟语气动词形式；西班牙语也有这种动词形式，但从未在对话中使用。

有趣的是（至少从语言学的角度来看），有些现代语言，例如普通话，只有一个动词时态：它们依靠句子中的其他词（例如时间副词或方面助词）来传达时间框架。此类语言将以类似于以下的形式表达现在、过去和未来：

“我现在读了一本书。”
“我昨天看了一本书。”
“我明天看书。”

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|Peak Usage of Some Languages

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Peak Usage of Some Languages

As you might have surmised, different languages have been in an influential position during the past 2,000 years. If you trace the popularity and influence of Indo-European languages, you will find periods of time with varying degrees of influence involving multiple languages, including Hebrew, Greek, Latin, Arabic, French, and English.

Latin is an Indo-European language (apparently derived from the Etruscan and Greek alphabets), and during the lst century AD, Latin became a mainstream language. In addition, romance languages are derived from Latin. Today Latin is considered a dead language in the sense that it’s not actively spoken on a daily basis by large numbers of people. The same is true of Sanskrit, which is a very old language from India.

During the Roman Empire, Latin and Greek were the official languages for administrative as well as military activities. In addition, Latin was an important language for diplomacy among countries for many centuries after the fall of the Roman Empire.

You might be surprised to know that Arabic was the lingua franca throughout the Mediterranean during the 10th and 11th centuries AD. As another example, French was spoken in many parts of Europe during the 18th century, including the Russian aristocracy.

Today English appears to be in its ascendancy in terms of the number of native English speakers as well as the number of people who speak English as a second (or third or fourth) language. Although Mandarin is a widely spoken Asian language, English is the lingua franca for commerce as well as technology: virtually every computer language is based on English.

机器学习代写|自然语言处理代写NLP代考|Languages and Regional Accents

机器学习代写|自然语言处理代写NLP代考|Languages and Slang

The existence of slang words is interesting and perhaps inevitable, they seem to flourish in every human language. Sometimes slang words are used for obfuscation so that only members of an “in group” understand the modified meaning of those words. Slang words can also be a combination of existing words, new words (but not officially recognized), and short-hand expressions. Slang can also “invert” the meaning of words (“bad” instead of “good”), which can be specific to an age group, minority, or region. In addition, slang can also assign an entirely unrelated meaning to a standard word (e.g., the slang terms “that’s dope,” “that’s sick,” and “the bomb”).

Slang words can also be specific to an age group to prevent communication with members of different age groups. For example, Japanese teens can communicate with each other by reversing the order of the syllables in a word, which renders those “words” incomprehensible to adults. The inversion of syllables is far more complex than “pig Latin,” in which the first letter of a word

is shifted to the end of the word, followed by the syllable “ay.” For example, “East Bay” (an actual location in the Bay Area in Silicon Valley) is humorously called “beast” in pig Latin.

Teenagers also use acronyms (perhaps as another form of slang) when sending text messages to each other. For example, the acronym “aos” means “adult over shoulder.” The acronym “bos” has several different meanings, including “brother over shoulder” and “boyfriend over shoulder.”

The slang terms that you use with your peers invariably simplifies communication with others in your in-group, sometimes accompanied by specialized interpretations to words (such as reversing their meaning). A simple example is the word zanahoria, which is the Spanish word for carrot. In colloquial speech in Venezuela, calling someone a zanahoria means that that person is very conservative and as “straight” as a carrot.

Slang enables people to be creative and also playfully break the rules of language. Both slang and colloquial speech simplify formal language and rarely (if ever) introduce greater complexity in alternate speech rules.

Perhaps that’s the reason that slang and colloquial speech cannot be controlled or regulated by anyone (or by any language committee): like water, they are fluid and adapt to the preferences of their speakers.

One more observation: while slang can be viewed as a creative by-product of standard speech, there is a reverse effect that can occur in certain situations. For example, you have probably noticed how influential subgenres are eventually absorbed (perhaps only partially) into mainstream culture: witness how commercials eventually incorporated a “softened” form of rap music and its rhythm in commercials for personal products. There’s a certain irony in hearing “Stairway to Heaven” as elevator music.

Another interesting concept is a “meme” (which includes Internet memes) in popular culture, which refers to something with humorous content. While slang words are often used to exclude people, a meme often attempts to communicate a particular sentiment. One such meme is “OK Boomer,” which some people view as a derogatory remark that’s sometimes expressed in a snarky manner, and much less often interpreted as a humorous term. Although language dialects can also involve regional accents and slang, they also have more distinct characteristics, as discussed in the next section.

NLP代考

机器学习代写|自然语言处理代写NLP代考|Peak Usage of Some Languages

正如您可能已经猜到的那样，在过去的 2000 年中，不同的语言一直处于有影响力的位置。如果追溯印欧语系的流行程度和影响力，您会发现不同时期的影响程度不同，涉及多种语言，包括希伯来语、希腊语、拉丁语、阿拉伯语、法语和英语。

拉丁语是一种印欧语系语言（显然源自伊特鲁里亚字母和希腊字母），在公元 1 世纪，拉丁语成为主流语言。此外，浪漫语言源自拉丁语。今天，拉丁语被认为是一种死语言，因为它没有被大量的人每天主动使用。梵语也是如此，它是一种来自印度的非常古老的语言。

在罗马帝国时期，拉丁语和希腊语是行政和军事活动的官方语言。此外，在罗马帝国灭亡后的几个世纪里，拉丁语一直是各国之间外交的重要语言。

您可能会惊讶地发现，在公元 10 世纪和 11 世纪，阿拉伯语是整个地中海的通用语。另一个例子是，18 世纪欧洲许多地方都使用法语，包括俄罗斯贵族。

今天，就以英语为母语的人数以及将英语作为第二（或第三或第四）语言的人数而言，英语似乎处于优势地位。尽管普通话是一种广泛使用的亚洲语言，但英语是商业和技术的通用语言：几乎所有计算机语言都以英语为基础。

机器学习代写|自然语言处理代写NLP代考|Languages and Regional Accents

口音、俚语和方言具有一些共同特征，但也可能存在一些显着差异。口音涉及修改单词的标准发音，这在同一国家的不同地区可能会有很大差异。

机器学习代写|自然语言处理代写NLP代考|Languages and Slang

俚语的存在很有趣，也许是不可避免的，它们似乎在每一种人类语言中都盛行。有时俚语被用于混淆，以便只有“组内”的成员才能理解这些词的修改含义。俚语也可以是现有词、新词（但未被官方认可）和速记表达的组合。俚语还可以“反转”单词的含义（“坏”而不是“好”），这可以特定于年龄组、少数民族或地区。此外，俚语还可以为标准词赋予完全不相关的含义（例如，俚语术语“that’s dope”、“that’s disease”和“the bomb”）。

俚语也可以是特定年龄组的，以防止与不同年龄组的成员交流。例如，日本青少年可以通过颠倒单词中的音节顺序来相互交流，这使得这些“单词”对于成年人来说是难以理解的。音节倒转比“猪拉丁语”复杂得多，其中单词的第一个字母

移到词尾，后跟音节“ay”。例如，“东湾”（硅谷湾区的一个实际位置）在猪拉丁语中被幽默地称为“野兽”。

青少年在互相发送短信时也会使用首字母缩略词（也许是另一种俚语）。例如，首字母缩略词“aos”的意思是“肩上的成人”。首字母缩略词“bos”有几种不同的含义，包括“肩上的兄弟”和“肩上的男朋友”。

您与同龄人一起使用的俚语总是简化与小组中其他人的交流，有时还伴随着对单词的专门解释（例如颠倒它们的含义）。一个简单的例子是 zanahoria 这个词，它是西班牙语中胡萝卜的意思。在委内瑞拉的口语中，称某人为 zanahoria 意味着该人非常保守，并且像胡萝卜一样“直”。

俚语使人们能够发挥创造力，也可以开玩笑地打破语言规则。俚语和口语都简化了正式语言，很少（如果有的话）在替代语音规则中引入更大的复杂性。

也许这就是俚语和口语不能被任何人（或任何语言委员会）控制或规范的原因：就像水一样，它们是流动的，可以适应说话者的偏好。

还有一个观察：虽然俚语可以被视为标准语音的创造性副产品，但在某些情况下可能会出现相反的效果。例如，您可能已经注意到有影响力的子流派最终是如何被（可能只是部分地）吸收到主流文化中：见证广告最终如何将“软化”形式的说唱音乐及其节奏融入个人产品的广告中。将“通往天堂的阶梯”作为电梯音乐听有一定的讽刺意味。

另一个有趣的概念是流行文化中的“模因”（包括互联网模因），指的是具有幽默内容的东西。虽然俚语经常被用来排斥人，但模因经常试图传达一种特定的情绪。一个这样的模因是“OK Boomer”，有些人认为这是一种贬义，有时会以一种尖刻的方式表达，而很少被解释为一个幽默的词。虽然语言方言也可能涉及地区口音和俚语，但它们也具有更鲜明的特征，如下一节所述。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|NLP Concepts

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|THE ORIGIN OF LANGUAGES

Someone once remarked that “the origin of language is an enigma,” which is viscerally appealing because it has at least a kernel of truth. Although there are multiple theories that attempt to explain how and why languages developed, none of them has attained universal consensus. Nevertheless, there is no doubt that humans have far surpassed all other species in terms of language development.

There is also the question of how the vocabulary of a language is formed, which can be the confluence of multiple factors, as well as meaning in a language. According to Ludwig Wittgenstein (1953), who was an influential philosopher in many other fields, language derives its meaning from use.

One theory about the evolution of language in humans asserts that the need for communication between humans makes language a necessity. Another explanation is that language is influenced by the task of creating complex tools, because the latter requires a precise sequence of steps, which ultimately spurred the development of languages.

Without delving into their details, the following list contains some theories that have been proposed regarding language development. Keep in mind that they vary in terms of their support in the academic community:

Strong Minimalist Thesis
The FlintKnapper Theory
The Sapir-Whorf Hypothesis
Universal Grammar (Noam Chomsky)
The Strong Minimalist Thesis (SRT) asserts that language is based on something called the hierarchical syntactic structure. The FlintKnapper Theory asserts that the ability to create complex tools involved an intricate sequence of steps, which in turn necessitated communication between people. In simplified terms, the Sapir-Whorf Hypothesis (also called the linguistic relativity hypothesis, which is a slightly weaker form) posits that the language we speak influences how we think. Consider how our physical environment can influence our spoken language: Eskimos have several words to describe snow, whereas people in some parts of the Middle East have never seen a snow storm.

机器学习代写|自然语言处理代写NLP代考|Language Fluency

As mentioned in the previous section, human infants are capable of producing the sounds of any language, given enough opportunity to imitate those

sounds. They tend to lose some of that capacity as they become older, which might explain why some adults speak another language with an accent (of course, there are plenty of exceptions).

Interestingly, babies respond favorably to the sound of vowel-rich “Parentese” and a study in 2018 suggested that babies prefer the sound of other babies instead of their mother:
https://eurekalert.org/pub_releases/2018-05/asoa-ftm042618.php
https://getpocket.com/explore/item/babies-prefer-the-sounds-of-otherbabies-to-the-cooing-of-their-parents

There are two interesting cases in which people can acquire native-level speech capability. The first case is intuitive: people who have been raised in a bilingual (or multilingual) environment tend to have a greater capacity for learning how to speak other languages with native level (or near native level) speech. Second, people who speak phonetic languages have an advantage when they study another phonetic language, especially one that is in their language group, because they already know how to pronounce the majority of vowel sounds. languages whose pronunciation can be a challenge for practically every non-native speaker. For example, letters that have a guttural sound (such as those in Dutch, German, and Arabic), the glottal stop (most noticeable in Arabic), and the letter “ain” in Arabic are generally more challenging to pronounce for native speakers of romance languages and some Asian languages.

To some extent, the non-phonetic nature of the English language might explain why some monolingual native-English speakers might struggle with learning to speak other languages with native-level speech. Perhaps the closest language to English (in terms of cadence) is Dutch, and people from Holland can often speak native-level English. This tends to be true of Swedes and Danes as well, whose languages are Germanic, but not necessarily true of Germans, who can speak perfect grammatical English but sometimes speak English with an accent.

Perhaps somewhat ironically, sometimes accents can impart a sort of cachet, such as speaking with a British or Australian accent in the United States. Indeed, a French accent can also add a certain je-ne-sais-quoi to a speaker in various parts of the United States.

机器学习代写|自然语言处理代写NLP代考|Major Language Groups

There are more than 140 language families, and the six largest language families (based on language count) are listed here:

Niger-Congo
Austronesian
Trans-New Guinea
Sino-Tibetan
Indo-European
Afro-Asiatic
English belongs to the Indo-European group, Mandarin belongs to the Sino-Tibetan, and Arabic belongs to the Afro-Asiatic group. According to Wikipedia, Indo-European languages comprise almost 600 languages, including most of the languages in Europe, the northern Indian subcontinent, and the Iranian plateau. Almost half the world speaks an Indo-European language as a native language, which is greater than any of the language groups listed in the introduction of this section. Indo-European has several major language subgroups, which are Germanic, Slavic, and Romance languages. The preceding information is from the following Wikipedia link:
https://en.wikipedia.org/wiki/List_of_language_families
As of 2019 , the top four languages that are spoken in the world, which counts the number of people who are native speakers or secondary speakers, are as follows:
English: $1.268$ billion
Mandarin: $1.120$ billion
Hindi: $637.3$ million
Spanish: $537.9$ million
French: $276.6$ million
The preceding information is from the following Wikipedia link:
https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_ speakers

Many factors can influence the expansion of a given language into multiple countries, such as commerce, economic factors, technological influence, and warfare, thereby resulting in the absorption of new words by another language. Somewhat intuitively, countries with a common border influence each other’s language, sometimes resulting in new hybrid languages.

NLP代考

机器学习代写|自然语言处理代写NLP代考|THE ORIGIN OF LANGUAGES

有人曾说过“语言的起源是个谜”，这句话很吸引人，因为它至少有一个真理的内核。尽管有多种理论试图解释语言的发展方式和原因，但没有一个获得普遍共识。尽管如此，毫无疑问，人类在语言发展方面已经远远超过了其他所有物种。

还有一个语言的词汇是如何形成的问题，这可能是多种因素的汇合，也可能是一种语言的意义。路德维希·维特根斯坦 (Ludwig Wittgenstein) (1953) 在许多其他领域都是有影响力的哲学家，他认为，语言的意义来自于使用。

一种关于人类语言进化的理论断言，人类之间交流的需要使得语言成为必要。另一种解释是语言受到创建复杂工具的任务的影响，因为后者需要精确的步骤顺序，这最终刺激了语言的发展。

在不深入研究它们的细节的情况下，以下列表包含了一些关于语言发展的理论。请记住，他们在学术界的支持各不相同：

强大的极简主义论文
FlintKnapper 理论
Sapir-Whorf 假说
通用语法 (Noam Chomsky)
强极简主义论文 (SRT) 断言语言是基于一种称为层次句法结构的东西。FlintKnapper 理论断言，创建复杂工具的能力涉及一系列错综复杂的步骤，这反过来又需要人与人之间的交流。简而言之，Sapir-Whorf 假设（也称为语言相对论假设，这是一种稍弱的形式）假设我们所说的语言会影响我们的思维方式。想想我们的物理环境如何影响我们的口语：爱斯基摩人有几个词来形容雪，而中东一些地区的人们从未见过暴风雪。

机器学习代写|自然语言处理代写NLP代考|Language Fluency

如上一节所述，人类婴儿能够发出任何语言的声音，只要有足够的机会模仿这些声音

声音。随着年龄的增长，他们往往会失去一些能力，这可以解释为什么有些成年人说另一种带有口音的语言（当然，也有很多例外）。

有趣的是，婴儿对元音丰富的“Parentese”的声音反应良好，2018 年的一项研究表明，婴儿更喜欢其他婴儿的声音而不是母亲的声音：
https ://eurekalert.org/pub_releases/2018-05/asoa- ftm042618.php
https://getpocket.com/explore/item/babies-prefer-the-sounds-of-otherbabies-to-the-cooing-of-their-parents

有两个有趣的案例可以让人们获得母语水平的语音能力。第一种情况很直观：在双语（或多语种）环境中长大的人往往有更大的能力学习如何用母语水平（或接近母语水平）的语言说其他语言。其次，说语音语言的人在学习另一种语音语言时具有优势，尤其是在他们的语言组中的语音语言，因为他们已经知道如何发音大多数元音。几乎所有非母语人士的发音都可能成为挑战的语言。例如，带有喉音的字母（如荷兰语、德语和阿拉伯语中的字母）、声门塞音（阿拉伯语中最明显）、

在某种程度上，英语的非语音特性可以解释为什么一些单语母语为英语的人可能会在学习用母语水平的语音说其他语言时遇到困难。也许最接近英语的语言（就节奏而言）是荷兰语，而荷兰人通常可以说母语水平的英语。瑞典人和丹麦人往往也是如此，他们的语言是日耳曼语，但德国人不一定如此，他们可以说完美的语法英语，但有时会说带有口音的英语。

或许有点讽刺的是，有时口音可以给人一种威望，例如在美国用英国或澳大利亚口音说话。事实上，法国口音也可以为美国各地的演讲者增添某种 je-ne-sais-quoi。

机器学习代写|自然语言处理代写NLP代考|Major Language Groups

有 140 多个语系，这里列出了六个最大的语系（基于语言数量）：

尼日尔-刚果
南岛语
跨新几内亚
汉藏
印欧语系
亚非
英语属于印欧语系，普通话属于汉藏语系，阿拉伯语属于亚非语系。根据维基百科，印欧语系包括近 600 种语言，包括欧洲、印度北部次大陆和伊朗高原的大部分语言。世界上几乎有一半的人将印欧语作为母语，这比本节介绍中列出的任何语言组都多。印欧语有几个主要的语言亚群，它们是日耳曼语、斯拉夫语和罗曼语。上述信息来自以下维基百科链接：
https ://en.wikipedia.org/wiki/List_of_language_families
截至 2019 年，以母语为母语或第二母语的人数计算，世界上使用最多的四种语言如下：
英语：1.268十亿
Mandarin: 1.120十亿
印地语：637.3百万
西班牙语：537.9百万
法语：276.6万以下是美国
使用的语言列表

许多因素可以影响给定语言向多个国家的扩展，例如商业、经济因素、技术影响和战争，从而导致新词被另一种语言吸收。直觉上，有共同边界的国家会影响彼此的语言，有时会产生新的混合语言。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|WHAT IS IMBALANCED CLASSIFICATION

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|WHAT IS IMBALANCED CLASSIFICATION

Imbalanced classification involves datasets with imbalanced classes. For example, suppose that class A has $99 \%$ of the data and class B has $1 \%$. Which classification algorithm would you use? Unfortunately, classification algorithms

don’t work well with this type of imbalanced dataset. Here is a list of several well-known techniques for handling imbalanced datasets:

Random resampling rebalances the class distribution.
Random oversampling duplicates data in the minority class.
Random undersampling deletes examples from the majority class.
SMOTE
Random resampling transforms the training dataset into a new dataset, which is effective for imbalanced classification problems.

The random undersampling technique removes samples from the dataset, and involves the following:

randomly remove samples from majority class
can be performed with or without replacement
alleviates imbalance in the dataset
may increase the variance of the classifier
may discard useful or important samples
However, random undersampling does not work well with a dataset that has a $99 \% / 1 \%$ split into two classes. Moreover, undersampling can result in losing information that is useful for a model.

Instead of random undersampling, another approach involves generating new samples from a minority class. The first technique involves oversampling examples in the minority class and duplicate examples from the minority class.
There is another technique that is better than the preceding technique, which involves the following:

synthesize new examples from minority class
a type of data augmentation for tabular data
this technique can be very effective
generate new samples from minority class
Another well-known technique is called SMOTE, which involves data augmentation (i.e., synthesizing new data samples) well before you use a classification algorithm. SMOTE was initially developed by means of the kNN algorithm (other options are available), and it can be an effective technique for handling imbalanced classes.

Yet another option to consider is the Python package imbal anced-learn in the scikit-learn-contrib project. This project provides various re-sampling techniques for datasets that exhibit class imbalance. More details are available online:
https://github.com/scikit-learn-contrib/imbalanced-learn.

机器学习代写|自然语言处理代写NLP代考|WHAT IS SMOTE

SMOTE is a technique for synthesizing new samples for a dataset. This technique is based on linear interpolation:

Step 1: Select samples that are close in the feature space.
Step 2: Draw a line between the samples in the feature space.
Step 3: Draw a new sample at a point along that line.
A more detailed explanation of the SMOTE algorithm is as follows:
Select a random sample “a” from the minority class.
Find $\mathrm{k}$ nearest neighbors for that example.
Select a random neighbor “b” from the nearest neighbors.
Create a line “L” that connects “a” and “b.”
Randomly select one or more points “c” on line L.
If need be, you can repeat this process for the other $(\mathrm{k}-1)$ nearest neighbors to distribute the synthetic values more evenly among the nearest neighbors.

The initial SMOTE algorithm is based on the kNN classification algorithm, which has been extended in various ways, such as replacing $\mathrm{kNN}$ with SVM. A list of SMOTE extensions is shown as follows:

selective synthetic sample generation
Borderline-SMOTE (kNN)
Borderline-SMOTE (SVM)
Adaptive Synthetic Sampling (ADASYN)

机器学习代写|自然语言处理代写NLP代考|ANALYZING CLASSIFIERS

This section is marked “optional” because its contents pertain to machine learning classifiers, which are not the focus of this book. However, it’s still worthwhile to glance through the material, or perhaps return to this section after you have a basic understanding of machine learning classifiers.

Several well-known techniques are available for analyzing the quality of machine learning classifiers. Two techniques are LIME and ANOVA, both of which are discussed in the following subsections.

LIME is an acronym for Local Interpretable Model-Agnostic Explanations. LIME is a model-agnostic technique that can be used with machine learning models. In LIME, you make small random changes to data samples and then observe the manner in which predictions change (or not). The approach involves changing the output (slightly) and then observing what happens to the output.

By way of analogy, consider food inspectors who test for bacteria in truckloads of perishable food. Clearly, it’s infeasible to test every food item in a truck (or a train car), so inspectors perform “spot checks” that involve testing randomly selected items. In an analogous fashion, LIME makes small changes to input data in random locations and then analyzes the changes in the associated output values.

However, there are two caveats to keep in mind when you use LIME with input data for a given model:

The actual changes to input values are model-specific.
This technique works on input that is interpretable.
Examples of interpretable input include machine learning classifiers (such as trees and random forests) and NLP techniques such as BoW (Bag of Words). Non-interpretable input involves “dense” data, such as a word embedding (which is a vector of floating point numbers).

You could also substitute your model with another model that involves interpretable data, but then you need to evaluate how accurate the approximation is to the original model.

NLP代考

机器学习代写|自然语言处理代写NLP代考|WHAT IS IMBALANCED CLASSIFICATION

不平衡分类涉及具有不平衡类的数据集。例如，假设 A 类有99%的数据和 B 类有1%. 你会使用哪种分类算法？不幸的是，分类算法

不适用于这种类型的不平衡数据集。以下是处理不平衡数据集的几种著名技术的列表：

随机重采样重新平衡类分布。
随机过采样会复制少数类中的数据。
随机欠采样从多数类中删除示例。
SMOTE
随机重采样将训练数据集转换为新的数据集，这对于不平衡的分类问题是有效的。

随机欠采样技术从数据集中删除样本，并涉及以下内容：

从多数类中随机删除样本
可以在有或没有更换的情况下进行
减轻数据集中的不平衡
可能会增加分类器的方差
可能会丢弃有用或重要的样本
但是，随机欠采样不适用于具有99%/1%分为两类。此外，欠采样会导致丢失对模型有用的信息。

另一种方法不是随机欠采样，而是从少数类中生成新样本。第一种技术涉及对少数类中的示例进行过采样，并从少数类中复制示例。
还有另一种技术比前面的技术更好，它涉及以下内容：

从少数类中合成新的例子
表格数据的一种数据扩充
这种技术非常有效
从少数类生成新样本
另一种众所周知的技术称为 SMOTE，它在使用分类算法之前就涉及数据增强（即合成新数据样本）。SMOTE 最初是通过 kNN 算法（其他选项可用）开发的，它可以成为处理不平衡类的有效技术。

另一个需要考虑的选项是 scikit-learn-contrib 项目中的 Python 包 imbal anced-learn。该项目为表现出类不平衡的数据集提供了各种重新采样技术。更多详细信息可在线获取：
https://github.com/scikit-learn-contrib/imbalanced-learn。

机器学习代写|自然语言处理代写NLP代考|WHAT IS SMOTE

SMOTE 是一种为数据集合成新样本的技术。该技术基于线性插值：

步骤 1：选择特征空间中相近的样本。
第 2 步：在特征空间中的样本之间画一条线。
第 3 步：在沿该线的一点绘制一个新样本。
SMOTE算法更详细的解释如下：
从少数类中选择一个随机样本“a”。
寻找ķ该示例的最近邻居。
从最近的邻居中选择一个随机邻居“b”。
创建一条连接“a”和“b”的线“L”。
在L线上随机选择一个或多个点“c”。
如果需要，您可以对另一个重复此过程(ķ−1)最近的邻居在最近的邻居之间更均匀地分配合成值。

最初的 SMOTE 算法是基于 kNN 分类算法，经过各种方式扩展，例如替换ķññ与支持向量机。SMOTE 扩展列表如下所示：

选择性合成样品生成
边界-SMOTE (kNN)
边界-SMOTE (SVM)
自适应合成采样 (ADASYN)

机器学习代写|自然语言处理代写NLP代考|ANALYZING CLASSIFIERS

这部分被标记为“可选”，因为它的内容与机器学习分类器有关，这不是本书的重点。但是，仍然值得浏览一下材料，或者在您对机器学习分类器有基本了解后返回本节。

几种众所周知的技术可用于分析机器学习分类器的质量。两种技术是 LIME 和 ANOVA，这两种技术都将在以下小节中讨论。

LIME 是 Local Interpretable Model-Agnostic Explanations 的首字母缩写词。LIME 是一种与模型无关的技术，可与机器学习模型一起使用。在 LIME 中，您对数据样本进行小的随机更改，然后观察预测更改（或不更改）的方式。该方法涉及（稍微）更改输出，然后观察输出发生了什么。

以类比的方式，考虑食品检查员在卡车装载的易腐食品中检测细菌。显然，对卡车（或火车车厢）中的每一种食品进行检测是不可行的，因此检查员会进行“抽查”，包括对随机选择的食品进行检测。以类似的方式，LIME 对随机位置的输入数据进行微小更改，然后分析相关输出值的变化。

但是，当您将 LIME 与给定模型的输入数据一起使用时，需要牢记两个注意事项：

输入值的实际变化是特定于模型的。
这种技术适用于可解释的输入。
可解释输入的示例包括机器学习分类器（例如树和随机森林）和 NLP 技术，例如 BoW（词袋）。不可解释的输入涉及“密集”数据，例如词嵌入（它是浮点数的向量）。

您也可以用另一个涉及可解释数据的模型替换您的模型，但随后您需要评估该近似值对原始模型的准确程度。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|MISSING DATA, ANOMALIES, AND OUTLIERS

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Missing Data

How you decide to handle missing data depends on the specific dataset. Here are some ways to handle missing data (the first three techniques are manual techniques, and the other techniques are algorithms):

replace missing data with the mean/median/mode value
infer (“impute”) the value for missing data
delete rows with missing data
isolation forest (tree-based algorithm)
minimum covariance determinant
local outlier factor
one-class SVM (Support Vector Machines)
In general, replacing a missing numeric value with zero is a risky choice: this value is obviously incorrect if the values of a feature are between 1,000 and 5,000 . For a feature that has numeric values, replacing a missing value with the average value is better than the value zero (unless the average equals zero); also consider using the median value. For categorical data, consider using the mode to replace a missing value.

If you are not confident that you can impute a “reasonable” value, consider dropping the row with a missing value, and then train a model with the imputed value and also with the deleted row.

One problem that can arise after removing rows with missing values is that the resulting dataset is too small. In this case, consider using SMOTE, which is discussed later in this chapter, in order to generate synthetic data.

机器学习代写|自然语言处理代写NLP代考|Anomalies and Outliers

In simplified terms, an outlier is an abnormal data value that is outside the range of “normal” values. For example, a person’s height in centimeters is typically between 30 centimeters and 250 centimeters. Hence, a data point (e.g., a row of data in a spreadsheet) with a height of 5 centimeters or a height of 500 centimeters is an outlier. The consequences of these outlier values are unlikely to involve a significant financial or physical loss (though they could adversely affect the accuracy of a trained model).

Anomalies are also outside the “normal” range of values (just like outliers), and they are typically more problematic than outliers: anomalies can have more severe consequences than outliers. For example, consider the scenario in which someone who lives in California suddenly makes a credit

card purchase in New York. If the person is on vacation (or a business trip), then the purchase is an outlier (it’s outside the typical purchasing pattern), but it’s not an issue. However, if that person was in California when the credit card purchase was made, then it’s most likely to be credit card fraud, as well as an anomaly.

Unfortunately, there is no simple way to decide how to deal with anomalies and outliers in a dataset. Although you can drop rows that contain outliers, keep in mind that doing so might deprive the dataset-and therefore the trained model – of valuable information. You can try modifying the data values (described as follows), but again, this might lead to erroneous inferences in the trained model. Another possibility is to train a model with the dataset that contains anomalies and outliers, and then train a model with a dataset from which the anomalies and outliers have been removed. Compare the two results and see if you can infer anything meaningful regarding the anomalies and outliers.

机器学习代写|自然语言处理代写NLP代考|Outlier Detection

Although the decision to keep or drop outliers is your decision to make, there are some techniques available that help you detect outliers in a dataset. This section contains a short list of some techniques, along with a very brief description and links for additional information.

Perhaps trimming is the simplest technique (apart from dropping outliers), which involves removing rows whose feature value is in the upper $5 \%$ range or the lower $5 \%$ range. Winsorizing the data is an improvement over trimming: set the values in the top $5 \%$ range equal to the maximum value in the 95 th percentile, and set the values in the bottom $5 \%$ range equal to the minimum in the 5th percentile.

The Minimum Covariance Determinant is a covariance-based technique, and a Python-based code sample that uses this technique is available online:
https://scikit-learn.org/stable/modules/outlier_detection.html.
The Local Outlier Factor (LOF) technique is an unsupervised technique that calculates a local anomaly score via the kNN (k Nearest Neighbor) algorithm. Documentation and short code samples that use LOF are available online:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors. LocalOutlierFactor.html.

Two other techniques involve the Huber and the Ridge classes, both of which are included as part of Sklearn. The Huber error is less sensitive to

outliers because it’s calculated via the linear loss, similar to the MAE (Mean Absolute Error). A code sample that compares Huber and Ridge is available online:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_huber_ ts_ridge.html.

You can also explore the Theil-Sen estimator and RANSAC, which are “robust” against outliers:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_theilsen. html and
https://en.wikipedia.org/wiki/Random_sample_consensus.
Four algorithms for outlier detection are discussed at the following site:
https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection. html.

One other scenario involves “local” outliers. For example, suppose that you use kMeans (or some other clustering algorithm) and determine that a value is an outlier with respect to one of the clusters. While this value is not necessarily an “absolute” outlier, detecting such a value might be important for your use case.

NLP代考

机器学习代写|自然语言处理代写NLP代考|Missing Data

您决定如何处理缺失数据取决于具体的数据集。以下是一些处理缺失数据的方法（前三种技术是手动技术，其他技术是算法）：

用均值/中值/众数替换缺失数据
推断（“估算”）缺失数据的值
删除缺少数据的行
隔离森林（基于树的算法）
最小协方差行列式
局部异常因子
一类 SVM（支持向量机）
一般来说，用零替换缺失的数值是一种冒险的选择：如果特征的值介于 1,000 和 5,000 之间，这个值显然是不正确的。对于具有数值的特征，用平均值代替缺失值优于零值（除非平均值等于零）；还可以考虑使用中值。对于分类数据，请考虑使用众数替换缺失值。

如果您不确定是否可以估算“合理”值，请考虑删除缺失值的行，然后使用估算值和删除的行训练模型。

删除具有缺失值的行后可能出现的一个问题是生成的数据集太小。在这种情况下，考虑使用本章稍后讨论的 SMOTE 来生成合成数据。

机器学习代写|自然语言处理代写NLP代考|Anomalies and Outliers

简而言之，异常值是超出“正常”值范围的异常数据值。例如，一个人的身高（以厘米计）通常在 30 厘米到 250 厘米之间。因此，高度为5厘米或高度为500厘米的数据点（例如电子表格中的一行数据）是异常值。这些异常值的后果不太可能涉及重大的财务或物理损失（尽管它们可能会对训练模型的准确性产生不利影响）。

异常也在“正常”值范围之外（就像异常值一样），它们通常比异常值更成问题：异常可能比异常值产生更严重的后果。例如，考虑一个住在加利福尼亚的人突然获得信用的场景

在纽约买卡。如果这个人正在度假（或出差），那么购买是异常值（它超出了典型的购买模式），但这不是问题。但是，如果该人在购买信用卡时在加利福尼亚，那么很可能是信用卡欺诈以及异常情况。

不幸的是，没有简单的方法来决定如何处理数据集中的异常和异常值。尽管您可以删除包含异常值的行，但请记住，这样做可能会剥夺数据集（因此训练模型）的有价值信息。您可以尝试修改数据值（如下所述），但同样，这可能会导致训练模型中的错误推断。另一种可能性是使用包含异常和异常值的数据集训练模型，然后使用已删除异常和异常值的数据集训练模型。比较这两个结果，看看您是否可以推断出有关异常和异常值的任何有意义的信息。

机器学习代写|自然语言处理代写NLP代考|Outlier Detection

尽管保留或删除异常值是您自己的决定，但有一些技术可以帮助您检测数据集中的异常值。本节包含一些技术的简短列表，以及非常简短的描述和附加信息的链接。

也许修剪是最简单的技术（除了丢弃异常值），它涉及删除特征值在 $5 \%$ 上限或 $5 \%$ 下限范围内的行。Winsorizing 数据是对修剪的改进：将顶部 $5 \%$ 范围内的值设置为等于第 95 个百分位数的最大值，并将底部 $5 \%$ 范围内的值设置为等于第 5 个百分位数的最小值百分位。5%范围或较低的5%范围。Winsorizing 数据是对修剪的改进：将前5%范围内的值设置为等于第 95 个百分位数中的最大值，并将底部5%范围内的值设置为等于第 5 个百分位数中的最小值。

最小协方差行列式是一种基于协方差的技术，使用此技术的基于 Python 的代码示例可在线获得：
https://scikit-learn.org/stable/modules/outlier_detection.html。
局部异常因子 (LOF) 技术是一种无监督技术，通过 kNN（k 最近邻）算法计算局部异常分数。使用 LOF 的文档和短代码示例可在线获取：
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors。LocalOutlierFactor.html。

另外两种技术涉及 Huber 和 Ridge 类，它们都包含在 Sklearn 中。Huber 误差对

异常值，因为它是通过线性损失计算的，类似于 MAE（平均绝对误差）。在线提供了一个比较 Huber 和 Ridge 的代码示例：
https://scikit-learn.org/stable/auto_examples/linear_model/plot_huber_ts_ridge.html。

您还可以探索 Theil-Sen 估计器和 RANSAC，它们对异常值“稳健”：
https://scikit-learn.org/stable/auto_examples/linear_model/plot_theilsen。html 和
https://en.wikipedia.org/wiki/Random_sample_consensus。
以下站点讨论了四种异常值检测算法：
https://www.kdnuggets.com/2018/12/four-techniques-outlier-detection。html。

另一种情况涉及“本地”异常值。例如，假设您使用 kMeans（或其他一些聚类算法）并确定某个值相对于其中一个聚类是异常值。虽然此值不一定是“绝对”异常值，但检测此类值可能对您的用例很重要。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Standardization

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Standardization

The standardization technique involves finding the mean mu and the standard deviation sigma, and then mapping each $x i$ value to (xi-mu)/sigma. Recall the following formulas:
$\mathrm{mu}=[\operatorname{SUM}(\mathrm{x})] / \mathrm{n}$
$\operatorname{variance}(\mathrm{x})=[$ SUM $(\mathrm{x}-\mathrm{xbar}) *(\mathrm{x}-\mathrm{xbar})] / \mathrm{n}$
sigma $=\operatorname{sqrt}($ variance $)$
As a simple illustration of standardization, suppose that the random variable $x$ has the values ${-1,0,1}$. Then $m u$ and sigma are calculated as follows:
mu $\quad=($ SUM $x i) / n=(-1+0+1) / 3=0$
variance $=\left[\mathrm{SUM}(\mathrm{xi}-\mathrm{mu})^{\wedge} 2\right] / \mathrm{n}$
$=\left[(-1-0)^{\wedge} 2+(0-0)^{\wedge} 2+(1-0)^{\wedge} 2\right] / 3$
$=2 / 3$
sigma $=\operatorname{sqrt}(2 / 3)=0.816$ (approximate value)
Hence, the standardization of ${-1,0,1}$ is ${-1 / 0.816,0 / 0.816$,
$1 / 0.816}$, which in turn equals the set of values ${-1.2254,0,1.2254}$.
As another example, suppose that the random variable $x$ has the values
${-6,0,6}$. Then mu and sigma are calculated as follows:
$m u=(\mathrm{SUM} \mathrm{xi}) / \mathrm{n}=(-6+0+6) / 3=0$
variance $=\left[S U M(x i-m u)^{\wedge} 2\right] / \mathrm{n}$
$=\left[(-6-0)^{\wedge} 2+(0-0)^{\wedge} 2+(6-0)^{\wedge} 2\right] / 3$
$=72 / 3$
$=24$
sigma $=\operatorname{sqrt}(24)=4.899$ (approximate value)

Hence, the standardization of ${-6,0,6}$ is ${-6 / 4.899,0 / 4.899$, $6 / 4.899}$, which in turn equals the set of values ${-1.2247,0,1.2247}$.
In the preceding two examples, the mean equals 0 in both cases, but the variance and standard deviation are significantly different. The normalization of a set of values always produces a set of numbers between 0 and 1 .

However, the standardization of a set of values can generate numbers that are less than $-1$ and greater than 1 ; this will occur when sigma is less than the minimum value of every term $|\mathrm{mu}-\mathrm{xi}|$, where the latter is the absolute value of the difference between mu and each xi value. In the preceding example, the minimum difference equals 1 , whereas sigma is $0.816$, and therefore the largest standardized value is greater than $1 .$

机器学习代写|自然语言处理代写NLP代考|What to Look for in Categorical Data

This section contains various suggestions for handling inconsistent data values, and you can determine which ones to adopt based on any additional factors that are relevant to your particular task. For example, consider dropping columns that have very low cardinality (equal to or close to 1), as well as numeric columns with zero or very low variance.

Next, check the contents of categorical columns for inconsistent spellings or errors. A good example pertains to the gender category, which can consist of a combination of the following values:
male
Male
female
Female
$\mathrm{m}$
f
$M$
$\mathrm{F}$
The preceding categorical values for gender can be replaced with two categorical values (unless you have a valid reason to retain some of the other values). Moreover, if you are training a model whose analysis involves a single gender, then you need to determine which rows (if any) of a dataset must be excluded. Also check categorical data columns for redundant or missing white spaces.

Check for data values that have multiple data types, such as a numerical column with numbers as numerals and some numbers as strings or objects.

机器学习代写|自然语言处理代写NLP代考|Mapping Categorical Data to Numeric Values

Character data is often called categorical data, examples of which include people’s names, home or work addresses, and email addresses. Many types of categorical data involve short lists of values. For example, the days of the week and the months in a year involve seven and twelve distinct values, respectively. Notice that the days of the week have a relationship: For example, each day has a previous day and a next day. However, the colors of an automobile are independent of each other: the color red is not “better” or “worse” than the color blue.

There are several well-known techniques for mapping categorical values to a set of numeric values. A simple example where you need to perform this conversion involves the gender feature in the Titanic dataset. This feature is one of the relevant features for training a machine learning model. The gender feature has ${\mathbf{M}, \mathrm{F}}$ as its set of possible values. As you will see later in this chapter, Pandas makes it very easy to convert the set of values ${M, F}$ to the set of values ${0,1}$.

Another mapping technique involves mapping a set of categorical values to a set of consecutive integer values. For example, the set {Red, Green, Blue} can be mapped to the set of integers $[0,1,2}$. The set ${$ Male, Female $}$ can be mapped to the set of integers ${0,1}$. The days of the week can be mapped to ${0,1,2,3,4,5,6}$. Note that the first day of the week depends on the country: In some cases it’s Sunday, and in other cases it’s Monday.

Another technique is called one-hot encoding, which converts each value to a vector (check Wikipedia if you need a refresher regarding vectors). Thus, {Male, Female} can be represented by the vectors $[1,0]$ and $[0,1]$, and the colors {Red, Green, Blue} can be represented by the vectors $[1,0,0]$, $[0,1,0]$, and $[0,0,1]$. If you vertically “line up” the two vectors for gender, they form a $2 \times 2$ identity matrix, and doing the same for the colors will form a $3 \times 3$ identity matrix.

If you vertically “line up” the two vectors for gender, they form a $2 \times 2$ identity matrix, and doing the same for the colors will form a $3 \times 3$ identity matrix, as shown here:
$$
[1,0,0]
$$
$[0,1,0]$
$[0,0,1]$

NLP代考

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Standardization

标准化技术涉及找到均值 mu 和标准差 sigma，然后映射每个X一世值为 (xi-mu)/sigma。回忆以下公式：
米在=[和⁡(X)]/n
方差⁡(X)=[和(X−Xb一个r)∗(X−Xb一个r)]/n
西格玛=平方⁡(方差)
作为标准化的简单说明，假设随机变量X有价值观−1,0,1. 然后米在和 sigma 的计算如下：
mu=(和X一世)/n=(−1+0+1)/3=0
方差=[小号在米(X一世−米在)∧2]/n
=[(−1−0)∧2+(0−0)∧2+(1−0)∧2]/3
=2/3
西格玛=平方⁡(2/3)=0.816（近似值）
因此，标准化−1,0,1是−1/0.816,0/0.816$,$1/0.816, 这又等于一组值−1.2254,0,1.2254.
再举一个例子，假设随机变量X有价值观
−6,0,6. 然后 mu 和 sigma 计算如下：
米在=(小号在米X一世)/n=(−6+0+6)/3=0
方差=[小号在米(X一世−米在)∧2]/n
=[(−6−0)∧2+(0−0)∧2+(6−0)∧2]/3
=72/3
=24
西格玛=平方⁡(24)=4.899（近似值）

因此，标准化−6,0,6是−6/4.899,0/4.899$,$6/4.899, 这又等于一组值−1.2247,0,1.2247.
在前面的两个示例中，均值在这两种情况下都等于 0，但方差和标准差显着不同。一组值的标准化总是产生一组介于 0 和 1 之间的数字。

但是，一组值的标准化可以生成小于−1并且大于 1 ；当 sigma 小于每一项的最小值时会发生这种情况|米在−X一世|，其中后者是 mu 和每个 xi 值之间的差的绝对值。在前面的示例中，最小差值等于 1 ，而 sigma 是0.816，因此最大的标准化值大于1.

机器学习代写|自然语言处理代写NLP代考|What to Look for in Categorical Data

本节包含处理不一致数据值的各种建议，您可以根据与您的特定任务相关的任何其他因素来确定采用哪些建议。例如，考虑删除基数非常低（等于或接近 1）的列，以及方差为零或非常低的数值列。

接下来，检查分类列的内容是否存在拼写不一致或错误。一个很好的例子与性别类别有关，它可以由以下值的组合组成：
男性
男性
女性
女性
米
F
米
F
前面的性别分类值可以替换为两个分类值（除非您有正当理由保留其他一些值）。此外，如果您正在训练一个分析涉及单一性别的模型，那么您需要确定必须排除数据集的哪些行（如果有）。还要检查分类数据列是否有多余或缺失的空格。

检查具有多种数据类型的数据值，例如将数字作为数字和一些数字作为字符串或对象的数字列。

机器学习代写|自然语言处理代写NLP代考|Mapping Categorical Data to Numeric Values

字符数据通常称为分类数据，其示例包括人名、家庭或工作地址以及电子邮件地址。许多类型的分类数据都涉及简短的值列表。例如，一周中的几天和一年中的月份分别涉及七个和十二个不同的值。请注意，一周中的天是有关系的：例如，每一天都有前一天和后一天。然而，汽车的颜色是相互独立的：红色并不比蓝色“好”或“差”。

有几种众所周知的技术可以将分类值映射到一组数值。需要执行此转换的一个简单示例涉及 Titanic 数据集中的性别特征。此功能是训练机器学习模型的相关功能之一。性别特征有米,F作为它的一组可能值。正如您将在本章后面看到的那样，Pandas 使转换一组值变得非常容易米,F到一组值0,1.

另一种映射技术涉及将一组分类值映射到一组连续整数值。例如，集合 {Red, Green, Blue} 可以映射到整数集合[0,1,2}[0,1,2}. 套装$米一个l和,F和米一个l和$可以映射到整数集0,1. 星期几可以映射到0,1,2,3,4,5,6. 请注意，一周的第一天取决于国家/地区：在某些情况下是星期日，在其他情况下是星期一。

另一种技术称为单热编码，它将每个值转换为向量（如果您需要有关向量的复习，请查看 Wikipedia）。因此，{Male, Female} 可以由向量表示[1,0]和[0,1], 颜色 {Red, Green, Blue} 可以用向量表示[1,0,0], [0,1,0]，和[0,0,1]. 如果你垂直“排列”这两个性别向量，它们会形成一个2×2单位矩阵，对颜色做同样的事情会形成一个3×3单位矩阵。

如果你垂直“排列”这两个性别向量，它们会形成一个2×2单位矩阵，对颜色做同样的事情会形成一个3×3单位矩阵，如下所示：

[1,0,0]
[0,1,0]
[0,0,1]

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|自然语言处理代写NLP代考|PREPARING DATASETS

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写自然语言处理NLP这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

自然语言处理（NLP）是指计算机程序理解人类语言的能力，因为它是口头和书面的，被称为自然语言。它是人工智能（AI）的一个组成部分。

我们提供的自然语言处理NLP及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|自然语言处理代写NLP代考|Discrete Data Versus Continuous Data

As a simple rule of thumb: discrete data is a set of values that can be counted, whereas continuous data must be measured. Discrete data can reasonably fit in a drop-down list of values, but there is no exact value for making such a determination. One person might think that a list of 500 values is discrete, whereas another person might think it’s continuous.

For example, the list of provinces of Canada and the list of states of the United States are discrete data values, but is the same true for the number of countries in the world (roughly 200 ) or for the number of languages in the world (more than 7,000$)$ ?

Values for temperature, humidity, and barometric pressure are considered continuous. Currency is also treated as continuous, even though there is a measurable difference between two consecutive values. The smallest

unit of currency for U.S. currency is one penny, which is $1 / 100$ th of a dollar (accounting-based measurements use the “mil,” which is $1 / 1,000$ th of a dollar).
Continuous data types can have subtle differences. For example, someone who is 200 centimeters tall is twice as tall as someone who is 100 centimeters tall; the same is true for 100 kilograms versus 50 kilograms. However, temperature is different: 80 degrees Fahrenheit is not twice as hot as 40 degrees Fahrenheit.

Furthermore, keep in mind that the meaning of the word “continuous” in mathematics is not necessarily the same as continuous in machine learning. In the former, a continuous variable (let’s say in the 2D Euclidean plane) can have an uncountably infinite number of values. A feature in a dataset that can have more values than can be reasonably displayed in a drop-down list is treated as though it’s a continuous variable.

For instance, values for stock prices are discrete: they must differ by at least a penny (or some other minimal unit of currency), which is to say, it’s meaningless to say that the stock price changes by one-millionth of a penny. However, since there are so many possible stock values, it’s treated as a continuous variable. The same comments apply to car mileage, ambient temperature, and barometric pressure.

机器学习代写|自然语言处理代写NLP代考|“Binning” Continuous Data

Binning refers to subdividing a set of values into multiple intervals, and then treating all the numbers in the same interval as though they had the same value.

As a simple example, suppose that a feature in a dataset contains the age of people in a dataset. The range of values is approximately between 0 and 120 , and we could bin them into 12 equal intervals, where each consists of 10 values: 0 through 9,10 through 19,20 through 29 , and so forth.

However, partitioning the values of people’s ages as described in the preceding paragraph can be problematic. Suppose that person A, person B, and person C are 29,30 , and 39 , respectively. Then person $A$ and person $B$ are probably more similar to each other than person $B$ and person C, but because of the way in which the ages are partitioned, $B$ is classified as closer to $C$ than to A. In fact, binning can increase Type I errors (false positive) and Type II errors (false negative), as discussed in this blog post (along with some alternatives to binning):
https://medium.com/@peterflom/why-binning-continuous-data-is-almostalways-a-mistake-ad0b3ald141f.

As another example, using quartiles is even more coarse-grained than the earlier age-related binning example. The issue with binning pertains to the consequences of classifying people in different bins, even though they are in close proximity to each other. For instance, some people struggle financially because they earn a meager wage, and they are disqualified from financial assistance because their salary is higher than the cutoff point for receiving any assistance.

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Normalization

A range of values can vary significantly, and it’s important to note that they often need to be scaled to a smaller range, such as values in the range $[-1,1]$ or $[0,1]$, which you can do via the tanh function or the sigmoid function, respectively.

For example, measuring a person’s height in terms of meters involves a range of values between $0.50$ meters and $2.5$ meters (in the vast majority of cases), whereas measuring height in terms of centimeters ranges between 50 centimeters and 250 centimeters: these two units differ by a factor of 100 . A person’s weight in kilograms generally varies between 5 kilograms and 200 kilograms, whereas measuring weight in grams differs by a factor of 1,000 . Distances between objects can be measured in meters or in kilometers, which also differ by a factor of 1,000 .

In general, use units of measure so that the data values in multiple features belong to a similar range of values. In fact, some machine learning algorithms require scaled data, often in the range of $[0,1]$ or $[-1,1]$. In addition to the tanh and sigmoid function, there are other techniques for scaling data, such as standardizing data (think Gaussian distribution) and normalizing data (linearly scaled so that the new range of values is in $[0,1]$ ).

The following examples involve a floating point variable $x$ with different ranges of values that will be scaled so that the new values are in the interval $[0,1]$.

Example 1: If the values of $x$ are in the range $[0,2]$, then $x / 2$ is in the range $[0,1]$.
Example 2: If the values of $x$ are in the range $[3,6]$, then $x-3$ is in the range $[0,3]$, and $(x-3) / 3$ is in the range $[0,1]$.
Example 3: If the values of $x$ are in the range $[-10,20]$, then $x+10$ is in the range $[0,30]$, and $(x+10) / 30$ is in the range of $[0,1]$.

NLP代考

机器学习代写|自然语言处理代写NLP代考|Discrete Data Versus Continuous Data

作为一个简单的经验法则：离散数据是一组可以计数的值，而必须测量连续数据。离散数据可以合理地放入值的下拉列表中，但没有确切的值可以做出这样的决定。一个人可能认为 500 个值的列表是离散的，而另一个人可能认为它是连续的。

例如，加拿大的省列表和美国的州列表是离散数据值，但对于世界上的国家数量（大约 200 个）或世界上的语言数量（超过 7,000) ?

温度、湿度和大气压力的值被认为是连续的。货币也被视为连续的，即使两个连续值之间存在可测量的差异。最小的

美元的货币单位是一便士，即1/100千分之一美元（基于会计的测量使用“mil”，即1/1,000一美元）。
连续数据类型可能有细微的差别。例如，200 厘米高的人是 100 厘米高的人的两倍；100 公斤对 50 公斤也是如此。但是，温度不同：80 华氏度不是 40 华氏度的两倍。

此外，请记住，数学中“连续”一词的含义不一定与机器学习中的连续相同。在前者中，连续变量（假设在 2D 欧几里得平面中）可以有无数个值。数据集中的特征值可能超过下拉列表中合理显示的值，被视为连续变量。

例如，股票价格的价值是离散的：它们必须至少相差一美分（或其他一些最小的货币单位），也就是说，说股票价格变化百万分之一美分是没有意义的。然而，由于有很多可能的股票值，它被视为一个连续变量。同样的评论适用于汽车里程、环境温度和大气压力。

机器学习代写|自然语言处理代写NLP代考|“Binning” Continuous Data

分箱是指将一组值细分为多个区间，然后将同一区间中的所有数字视为具有相同的值。

举个简单的例子，假设数据集中的一个特征包含数据集中人的年龄。值的范围大约在 0 到 120 之间，我们可以将它们分成 12 个相等的间隔，其中每个包含 10 个值：0 到 9,10 到 19,20 到 29 等等。

但是，如前一段所述划分人们的年龄值可能会出现问题。假设人 A、人 B 和人 C 分别是 29,30 和 39 。那么人一个和人乙可能比人更相似乙和 C 人，但由于时代划分的方式，乙被归类为更接近C比 A。事实上，分箱会增加 I 型错误（误报）和 II 型错误（误报），如本博文中所述（以及分箱的一些替代方案）：
https ://medium.com/@ peterflom/why-binning-continuous-data-is-almost always-a-mistake-ad0b3ald141f。

作为另一个示例，使用四分位数甚至比早期的与年龄相关的分箱示例更粗粒度。分箱的问题与将人分类在不同的箱中的后果有关，即使他们彼此非常接近。例如，有些人因为工资微薄而陷入财务困境，他们因工资高于接受任何援助的临界点而被取消获得经济援助的资格。

机器学习代写|自然语言处理代写NLP代考|Scaling Numeric Data via Normalization

值的范围可能会有很大的不同，需要注意的是，它们通常需要缩放到更小的范围，例如范围内的值[−1,1]或者[0,1]，您可以分别通过 tanh 函数或 sigmoid 函数来完成。

例如，以米为单位测量一个人的身高涉及到以下值的范围：0.50米和2.5米（在绝大多数情况下），而以厘米为单位的高度测量范围在 50 厘米和 250 厘米之间：这两个单位相差 100 倍。一个人的公斤体重通常在 5 公斤到 200 公斤之间变化，而以克为单位的体重则相差 1,000 倍。物体之间的距离可以以米或公里为单位测量，它们也相差 1,000 倍。

通常，使用度量单位，以便多个要素中的数据值属于相似的值范围。事实上，一些机器学习算法需要缩放数据，通常在[0,1]或者[−1,1]. 除了 tanh 和 sigmoid 函数之外，还有其他用于缩放数据的技术，例如标准化数据（想想高斯分布）和标准化数据（线性缩放，以便新的值范围在[0,1] ).

以下示例涉及浮点变量X具有不同范围的值，这些值将被缩放，以便新值在区间内[0,1].

示例 1：如果X在范围内[0,2]，然后X/2在范围内[0,1].
示例 2：如果X在范围内[3,6]，然后X−3在范围内[0,3]，和(X−3)/3在范围内[0,1].
示例 3：如果X在范围内[−10,20]，然后X+10在范围内[0,30]，和(X+10)/30是在范围内[0,1].

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|tensorflow代写|Polynomial modelUsing regression for call-center volume prediction

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写tensorflow这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

TensorFlow是一个用于机器学习和人工智能的免费和开源的软件库。它可以用于一系列的任务，但特别关注深度神经网络的训练和推理。

statistics-lab™ 为您的留学生涯保驾护航在代写tensorflow方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写tensorflow代写方面经验极为丰富，各种代写tensorflow相关的作业也就用不着说。

我们提供的tensorflow及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|tensorflow代写|Polynomial modelUsing regression for call-center volume prediction

机器学习代写|tensorflow代写|Cleaning the data for regression

First, download this data-a set of phone calls from the summer of 2014 from the New York City 311 service-from http://mng.bz/P16w. Kaggle has other 311 datasets, but you’ll use this particular data due to its interesting properties. The calls are formatted as a comma-separated values (CSV) file that has several interesting features, including the following:

A unique call identifier showing the date when the call was created
” The location and ZIP code of the reported incident or information request
The specific action that the agent on the call took to resolve the issue
What borough (such as the Bronx or Queens) the call was made from
The status of the call
This dataset contains lot of useful information for machine learning, but for purposes of this exercise, you care only about the call-creation date. Create a new file named 311.py. Then write a function to read each line in the CSV file, detect the week number, and sum the call counts by week.

Your code will need to deal with some messiness in this data file. First, you aggregate individual calls, sometimes hundreds in a single day, into a seven-day or weekly bin, as identified by the bucket variable in listing 4.1. The freq (short for frequency) variable holds the value of calls per week and per year. If the $311 \mathrm{CSV}$ contains more than a year’s worth of data (as other 311 CSVs that you can find on Kaggle do), gin up your code to allow for selection by year of calls to train on. The result of the code in listing $4.1$ is a freq dictionary whose values are the number of calls indexed by year and by week number via the period variable. The $t$. tm_year variable holds the parsed year resulting from passing the call-creation-time value (indexed in the CSV as date_idx, an integer defining the column number where the date field is located) and the date_parse format string to Python’s time library’s strptime (or string parse time) function. The date parse format string is a pattern defining the way the date appears as text in the CSV so that Python knows how to convert it to a datetime representation.

机器学习代写|tensorflow代写|What’s in a bell curve? Predicting Gaussian distributions

A bell or normal curve is a common term to describe data that we say fits a normal distribution. The largest $Y$ values of the data occur in the middle or statistically the mean $\mathrm{X}$ value of the distribution of points, and the smaller $Y$ values occur on the early and tail X values of the distribution. We also call this a Gaussian distribution after the famous German mathematician Carl Friedrich Gauss, who was responsible for the Gaussian function that describes the normal distribution.

We can use the NumPy method np.random.normal to generate random points sampled from the normal distribution in Python. The following equation shows the Gaussian function that underlies this distribution:
$$
e^{\frac{\left(-(x-\mu)^{2}\right)}{2 \sigma^{2}}}
$$
The equation includes the parameters $\mu$ (pronounced $m u$ ) and $\sigma$ (pronounced sigma), where $m u$ is the mean and sigma is the standard deviation of the distribution, respectively. Mu and sigma are the parameters of the model, and as you have seen, TensorFlow will learn the appropriate values for these parameters as part of training a model.

To convince yourself that you can use these parameters to generate bell curves, you can type the code snippet in listing $4.3$ into a file named gaussian.py and then run it to produce the plot that follows it. The code in listing $4.3$ produces the bell curve visualizations shown in figure 4.4. Note that I selected values of mu between $-1$ and 2 . You should see center points of the curve in figure 4.4, as well as standard deviations (sigma) between 1 and 3 , so the width of the curves should correspond to those values inclusively. The code plots 120 linearly-spaced points with $\mathrm{X}$ values between $-3$ and 3 and $\mathrm{Y}$ values between 0 and 1 that fit the normal distribution according to $\mathrm{mu}$ and sigma, and the output should look like figure 4.4.

机器学习代写|tensorflow代写|Training your call prediction regressor

Now you are ready to use TensorFlow to fit your NYC 311 data to this model. It’s probably clear by looking at the curves that they seem to comport naturally with the 311 data, especially if TensorFlow can figure out the values of mu that put the center point of the curve near spring and summer and that have a fairly large call volume, as well as the sigma value that approximates the best standard deviation.

Listing $4.4$ sets up the TensorFlow training session, associated hyperparameters, learning rate, and number of training epochs. I’m using a fairly large step for learning rate so that TensorFlow can appropriately scan the values of mu and sig by taking bigenough steps before settling down. The number of epochs-5,000-gives the algorithm enough training steps to settle on optimal values. In local testing on my laptop, these hyperparameters arrived at strong accuracy $(99 \%)$ and took less than a minute. But I could have chosen other hyperparameters, such as a learning rate of $0.5$, and given the training process more steps (epochs). Part of the fun of machine learning is hyperparameter training, which is more art than science, though techniques such as meta-learning and algorithms such as HyperOpt may ease this process in the future. A full discussion of hyperparameter tuning is beyond the scope of this chapter, but an online search should yields thousands of relevant introductions.

When the hyperparameters are set up, define the placeholders $\mathrm{X}$ and $\mathrm{Y}$, which will be used for the input week number and associated number of calls (normalized), respectively. Earlier, I mentioned normalizing the Y values and creating the ny_train variable in listing $4.2$ to ease learning. The reason is that the model Gaussian function that we are attempting to learn has $\mathrm{Y}$ values only between 0 and 1 due to the exponent e. The model function defines the Gaussian model to learn, with the associated variables mu and sig initialized arbitrarily to 1. The cost function is defined as the L2 norm, and the training uses Gradient descent. After training your regressor for 5,000 epochs, the final steps in listing $4.4$ print the learned values for mu and sig.

tensorflow代考

机器学习代写|tensorflow代写|Polynomial model

线性模型可能是一个直观的初步猜测，但现实世界的相关性很少如此简单。例如，导弹穿过太空的轨迹相对于地球上的观察者是弯曲的。Wi-Fi 信号强度会按照平方反比定律降低。一朵花在其一生中的高度变化肯定不是线性的。

当数据点似乎形成平滑曲线而不是直线时，您需要将回归模型从直线更改为其他模型。一种这样的方法是使用多项式模型。多项式是线性函数的推广。这n次多项式如下所示：

F(X)=在nXn+…+在1X+在0
注意何时n=1, 多项式只是一个线性方程F(X)=在1X+在0.
考虑图中的散点图3.10，显示上的输入X-轴和y轴上的输出。如您所知，一条直线不足以描述所有数据。多项式函数是线性函数的更灵活的推广。

机器学习代写|tensorflow代写|Regularization

不要被多项式的奇妙灵活性所迷惑，如部分所示3.3. 仅仅因为高阶多项式是低阶多项式的扩展并不意味着您应该总是更喜欢更灵活的模型。

在现实世界中，原始数据很少形成模拟多项式的平滑曲线。假设您正在绘制一段时间内的房价。数据可能会包含波动。回归的目标是用一个简单的数学方程来表示复杂性。如果您的模型过于灵活，则模型可能会使其对输入的解释过于复杂。

以图 3 .12 中的数据为例。您尝试将八次多项式拟合到似乎遵循等式的点是=X2. 这个过程惨遭失败，因为算法尽力更新多项式的九个系数。

影响学习算法产生更小的系数向量（我们称之为在)，您将惩罚添加到损失项中。为了控制你想要衡量惩罚项的重要性，你将惩罚乘以一个恒定的非负数，λ，如下：

成本⁡(X,是)=失利⁡(X,是)+λ
如果λ设置为 0 ，正则化不起作用。当你设置λ对于越来越大的值，具有较大范数的参数将受到严重惩罚。范数的选择因情况而异，但参数通常由它们的 L1 或 L2 范数来衡量。简而言之，正则化降低了原本容易缠结的模型的一些灵活性。

找出正则化参数的值λ性能最好，您必须将数据集拆分为两个不相交的集合。关于70%随机选择的输入/输出对将由训练数据集组成；剩余的30%将用于测试。您将使用清单中提供的功能3.4用于分割数据集。

机器学习代写|tensorflow代写|Application of linear regression

对虚假数据进行线性回归就像买了一辆新车却从不开车。这个令人敬畏的机器乞求在现实世界中表现出来！幸运的是，网上有很多数据集可以用来测试你新发现的回归知识：

马萨诸塞大学阿默斯特分校在 https://scholarworks.umass.edu/data 提供各种类型的小型数据集。
Kaggle 在 https://www.kaggle.com/datasets 为机器学习竞赛提供所有类型的大规模数据。
= Data.gov (https://catalog.data.gov) 是美国政府的一项开放数据计划，其中包含许多有趣且实用的数据集。

大量数据集包含日期。例如，您可以在 https://www .dropbox.com/s/naw774olqkve7sc/311.csv?dl=0 找到所有拨打加利福尼亚州洛杉矶 311 非紧急热线电话的数据集。一个很好的跟踪功能可能是每天、每周或每月的呼叫频率。为方便起见，列出3.6允许您获取数据项的每周频率计数。

import csv import time
def read(filename, date_idx, date_parse, year, bucket=7)=
days_in_year=365
频率=∣
为范围内的周期设置初始频率图(0, int(days_in year / bucket)):
频率 [期间]=0
使用 open(filename, “rb’) as csvfile: csvreader = csv. 阅读器（csvfile）下一个（）
读取csvreader 中行的每个周期的数据和聚合计数：
如果排⁡[date_idx]==′=
继续
吨=time.strptime (row [date_idx], date_parse)
if t.tm_year == year and吨.tm_yday<(days_in_year-1):
频率[int(t.tm_yday / bucket)]+=1
return freq
此代码为您提供线性回归的训练数据。freq 变量是一个字典，它将一个周期（例如一周）映射到一个频率计数。一年有 52 周，因此如果您保持 bucket=7 不变，您将拥有 52 个数据点。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

机器学习代写|tensorflow代写|Polynomial model

Posted on 2022年5月31日2022年5月31日 by statistics-lab

如果你也在怎样代写tensorflow这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

TensorFlow是一个用于机器学习和人工智能的免费和开源的软件库。它可以用于一系列的任务，但特别关注深度神经网络的训练和推理。

我们提供的tensorflow及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等概率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

机器学习代写|tensorflow代写|Polynomial model

Linear models may be an intuitive first guess, but real-world correlations are rarely so simple. The trajectory of a missile through space, for example, is curved relative to the observer on Earth. Wi-Fi signal strength degrades with an inverse square law. The change in height of a flower over its lifetime certainly isn’t linear.

When data points appear to form smooth curves rather than straight lines, you need to change your regression model from a straight line to something else. One such approach is to use a polynomial model. A polynomial is a generalization of a linear function. The $n$th degree polynomial looks like the following:
$$
f(x)=w_{n} x^{n}+\ldots+w_{1} x+w_{0}
$$
NOTE When $n=1$, a polynomial is simply a linear equation $f(x)=w_{1} x+\mathrm{w}_{0}$.
Consider the scatter plot in figure $3.10$, showing the input on the $x$-axis and the output on the y-axis. As you can tell, a straight line is insufficient to describe all the data. A polynomial function is a more flexible generalization of a linear function.

机器学习代写|tensorflow代写|Regularization

Don’t be fooled by the wonderful flexibility of polynomials, as shown in section $3.3$. Just because higher-order polynomials are extensions of lower ones doesn’t mean that you should always prefer the more flexible model.

In the real world, raw data rarely forms a smooth curve mimicking a polynomial. Suppose that you’re plotting house prices over time. The data likely will contain fluctuations. The goal of regression is to represent the complexity in a simple mathematical equation. If your model is too flexible, the model may be overcomplicating its interpretation of the input.

Take, for example, the data presented in figure 3 .12. You try to fit an eighth-degree polynomial into points that appear to follow the equation $y=x^{2}$. This process fails miserably, as the algorithm tries its best to update the nine coefficients of the polynomial.

To influence the learning algorithm to produce a smaller coefficient vector (let’s call it $w$ ), you add that penalty to the loss term. To control how significantly you want to weigh the penalty term, you multiply the penalty by a constant non-negative number, $\lambda$, as follows:
$$
\operatorname{Cost}(X, Y)=\operatorname{Loss}(X, Y)+\lambda
$$
If $\lambda$ is set to 0 , regularization isn’t in play. As you set $\lambda$ to larger and larger values, parameters with larger norms will be heavily penalized. The choice of norm varies case by case, but parameters are typically measured by their Ll or L2 norm. Simply put, regularization reduces some of the flexibility of the otherwise easily tangled model.

To figure out which value of the regularization parameter $\lambda$ performs best, you must split your dataset into two disjointed sets. About $70 \%$ of the randomly chosen input/output pairs will consist of the training dataset; the remaining $30 \%$ will be used for testing. You’ll use the function provided in listing $3.4$ for splitting the dataset.

机器学习代写|tensorflow代写|Application of linear regression

Running linear regression on fake data is like buying a new car and never driving it. This awesome machinery begs to manifest itself in the real world! Fortunately, many datasets are available online to test your newfound knowledge of regression:

The University of Massachusetts Amherst supplies small datasets of various types at https://scholarworks.umass.edu/data.
Kaggle provides all types of large-scale data for machine-learning competitions at https://www.kaggle.com/datasets.
= Data.gov (https://catalog.data.gov) is an open data initiative by the US government that contains many interesting and practical datasets.

A good number of datasets contain dates. You can find a dataset of all phone calls to the 311 nonemergency line in Los Angeles, California, for example, at https://www .dropbox.com/s/naw774olqkve7sc/311.csv?dl=0. A good feature to track could be the frequency of calls per day, week, or month. For convenience, listing $3.6$ allows you to obtain a weekly frequency count of data items.

import csv import time
def read(filename, date_idx, date_parse, year, bucket $=7)=$
days_in_year $=365$
freq $={} \quad \mid$ Sets up initial frequency map
for period in range $(0$, int(days_in year / bucket)):
freq [period] $=0$
With open(filename, “rb’) as csvfile: csvreader = csv. reader (csvfile) csvreader. next() $\quad$ Reads data and aggregates count per period
for row in csvreader:
if $\operatorname{row}\left[\right.$ date_idx] $=={ }^{\prime}=$
continue
$t=$ time.strptime (row [date_idx], date_parse)
if t.tm_year == year and $t .$ tm_yday $<$ (days_in_year-1):
freq[int(t.tm_yday / bucket)] $+=1$
return freq
This code gives you the training data for linear regression. The freq variable is a dictionary that maps a period (such as a week) to a frequency count. A year has 52 weeks, so you’ll have 52 data points if you leave bucket=7 as is.

tensorflow代考

机器学习代写|tensorflow代写|Polynomial model

机器学习代写|tensorflow代写|Regularization

不要被多项式的奇妙灵活性所迷惑，如部分所示3.3. 仅仅因为高阶多项式是低阶多项式的扩展并不意味着您应该总是更喜欢更灵活的模型。

以图 3 .12 中的数据为例。您尝试将八次多项式拟合到似乎遵循等式的点是=X2. 这个过程惨遭失败，因为算法尽力更新多项式的九个系数。

机器学习代写|tensorflow代写|Application of linear regression

马萨诸塞大学阿默斯特分校在 https://scholarworks.umass.edu/data 提供各种类型的小型数据集。
Kaggle 在 https://www.kaggle.com/datasets 为机器学习竞赛提供所有类型的大规模数据。
= Data.gov (https://catalog.data.gov) 是美国政府的一项开放数据计划，其中包含许多有趣且实用的数据集。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写