分类：机器学习/统计学习代写

计算机代写|机器学习代写machine learning代考|COMP5328

Posted on 2023年7月24日2023年8月25日 by statistics-lab

如果你也在怎样代写机器学习Machine Learning 这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。机器学习Machine Learning令人兴奋。这是有趣的，具有挑战性的，创造性的，和智力刺激。它还为公司赚钱，自主处理大量任务，并从那些宁愿做其他事情的人那里消除单调工作的繁重任务。

机器学习Machine Learning也非常复杂。从数千种算法、数百种开放源码包，以及需要具备从数据工程(DE)到高级统计分析和可视化等各种技能的专业实践者，ML专业实践者所需的工作确实令人生畏。增加这种复杂性的是，需要能够与广泛的专家、主题专家(sme)和业务单元组进行跨功能工作——就正在解决的问题的性质和ml支持的解决方案的输出进行沟通和协作。

statistics-lab™ 为您的留学生涯保驾护航在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

计算机代写|机器学习代写machine learning代考|Have you met your data?

What I mean by meeting isn’t the brief and polite nod of acknowledgment when passing your data on the way to refill your coffee. Nor is it the 30 -second rushed socially awkward introduction at a tradeshow meetup. Instead, the meeting that you should be having with your data is more like an hours-long private conversation in a quiet, wellfurnished speakeasy over a bottle of Macallan Rare Cask, sharing insights and delving into the nuances of what embodies the two of you as dram after silken dram caresses your digestive tracts: really and truly getting to know it.
TIP Before writing a single line of code, even for experimentation, make sure you have the data needed to answer the basic nature of the problem in the simplest way possible (an if/else statement). If you don’t have it, see if you can get it. If you can’t get it, move on to something you can solve.
As an example of the dangers of a mere passing casual rendezvous with data being used for problem solving, let’s pretend that we both work at a content provider company. Because of the nature of the business model at our little company, our content is listed on the internet behind a timed paywall. For the first few articles that are read, no ads are shown, content is free to view, and the interaction experience is bereft of interruptions. After a set number of articles, an increasingly obnoxious series of pop-ups and disruptions are presented to coerce a subscription registration from the reader.

The prior state of the system was set by a basic heuristic controlled through the counting of article pages that the end user had seen. Realizing that this would potentially be off-putting for someone browsing during their first session on the platform, this was then adjusted to look at session length and an estimate of how many lines of each article had been read. As time went on, this seemingly simple rule set became so unwieldy and complex that the web team asked our DS team to build something that could predict on a per-user level the type and frequency of disruptions that would maximize subscription rates.

We spend a few months, mostly using the prior work that was built to support the heuristics approach, having the data engineering team create mirrored ETL processes of the data structures and manipulation logic that the frontend team has been using to generate decision data. With the data available in the data lake, we proceed to build a highly effective and accurate model that seems to perform exceptionally well on all of our holdout tests.

计算机代写|机器学习代写machine learning代考|Make sure you have the data

This example might seem a bit silly, but I’ve seen this situation play out dozens of times. Having an inability to get at the right data for model serving is a common problem.
I’ve seen teams work with a manually extracted dataset (a one-time extract), build a truly remarkable solution with that data, and when ready to release the project to production, realize at the 11 th hour that the process for building that one-time extract required entirely manual actions by a DE team. The necessary data to make the solution effective was siloed off in a production infrastructure that the DS and DE teams had no ability to access. Figure 14.2 shows a rather familiar sight that I’ve borne witness to far too many times.

With no infrastructure present to bring the data into a usable form for predictions, as shown in figure 14.2, an entire project needs to be created for the DE team to build the ETL needed to materialize the data in a scheduled manner. Depending on the complexity of the data sources, this could take a while. Building hardened productiongrade ETL jobs that pull from multiple production relational databases and in-memory key-value stores is not a trivial reconciliation act, after all. Delays like this could lead (and have led) to project abandonment, regardless of the predictive capabilities of the DS portion of the solution.

This problem of complex ETL job creation becomes even more challenging if the predictions need to be conducted online. At that point, it’s not a question of the DE team working to get ETL processes running; rather, disparate groups in the engineering organization will have to accumulate the data into a single place in order to generate the collection of attributes that can be fed into a REST API request to the ML service.

This entire problem is solvable, though. During the time of EDA, the DS team should be evaluating the nature of the data generation, asking pointed questions to the data warehousing team:

Can the data be condensed to the fewest possible tables to reduce costs?
-What is the team’s priority for fixing these sources if something breaks down?
Can I access this data from both the training and serving layers?
-Will querying this data for serving meet the project SLA?

机器学习代考

计算机代写|机器学习代写machine learning代考|Have you met your data?

我所说的会面，并不是在你传递数据、去续杯咖啡的路上，简短而礼貌地点头致意。也不是在展会上匆忙的30秒尴尬的自我介绍。相反，你应该与你的数据进行的会议更像是在一个安静、设备完善的地下酒吧里，喝着一瓶麦卡伦稀有酒桶(Macallan Rare Cask)，进行长达数小时的私人谈话，分享见解，深入研究体现你们两人的细微差别，就像一杯又一杯柔滑的威士忌抚摸着你的消化道:真正真正地了解它。
提示:在编写一行代码之前，即使是为了进行实验，也要确保您拥有以最简单的方式(if/else语句)回答问题的基本性质所需的数据。如果你没有，看看你能不能得到它。如果你不能得到它，那就转向你能解决的问题。
为了说明与用于解决问题的数据仅仅是偶然相遇的危险，让我们假设我们都在一家内容提供商公司工作。由于我们这个小公司的商业模式的性质，我们的内容是在互联网上按时间收费的。对于阅读的前几篇文章，没有广告显示，内容可以自由查看，并且交互体验没有中断。在读完一定数量的文章后，会出现一系列令人讨厌的弹出窗口和干扰，迫使读者注册订阅。

系统的先验状态由一个基本的启发式设置，该启发式通过计算最终用户看过的文章页数来控制。意识到这可能会让那些在平台上的第一次浏览期间浏览的人感到不快，然后调整到查看会话长度和每篇文章的阅读行数估计。随着时间的推移，这个看似简单的规则集变得如此笨拙和复杂，以至于网络团队要求我们的DS团队构建一些东西，可以在每个用户的层面上预测中断的类型和频率，从而最大化订阅率。

我们花了几个月的时间，主要是使用之前为支持启发式方法而构建的工作，让数据工程团队创建数据结构和操作逻辑的镜像ETL流程，前端团队一直使用这些流程来生成决策数据。有了数据湖中可用的数据，我们继续构建一个非常有效和准确的模型，该模型似乎在我们所有的holdout测试中都表现得非常好。

计算机代写|机器学习代写machine learning代考|Make sure you have the data

这个例子可能看起来有点傻，但我已经看到这种情况发生过几十次了。无法为模型服务获取正确的数据是一个常见的问题。
我见过一些团队使用手动提取的数据集(一次性提取)，使用该数据构建真正出色的解决方案，并在准备将项目发布到生产环境时，在第11个小时意识到构建一次性提取的过程完全需要DE团队的手动操作。使解决方案有效的必要数据被隔离在生产基础设施中，DS和DE团队无法访问这些数据。图14.2显示了一个相当熟悉的场景，我已经见过太多次了。

如图14.2所示，由于没有基础设施将数据转换为可用于预测的形式，因此需要为DE团队创建一个完整的项目，以构建以预定方式实现数据所需的ETL。根据数据源的复杂性，这可能需要一段时间。毕竟，构建从多个生产关系数据库和内存中的键值存储中提取的坚固的生产级ETL作业并不是一个微不足道的协调行为。不管解决方案的DS部分的预测能力如何，像这样的延迟可能导致(并且已经导致)项目放弃。

如果预测需要在线进行，那么复杂的ETL创造就业机会的问题就变得更具挑战性。在这一点上，这不是DE团队努力使ETL进程运行的问题;相反，工程组织中的不同组必须将数据积累到一个地方，以便生成可以提供给ML服务的REST API请求的属性集合。

不过，整个问题是可以解决的。在EDA期间，DS团队应该评估数据生成的性质，向数据仓库团队提出尖锐的问题:

能否将数据压缩到尽可能少的表中以降低成本?
-如果有东西坏了，团队修复这些源的首要任务是什么?

我可以从训练层和服务层访问这些数据吗?
-为服务而查询这些数据是否符合项目SLA?

计算机代写|机器学习代写machine learning代考请认准statistics-lab™

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

金融工程是使用数学技术来解决金融问题。金融工程使用计算机科学、统计学、经济学和应用数学领域的工具和知识来解决当前的金融问题，以及设计新的和创新的金融产品。

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

术语广义线性模型（GLM）通常是指给定连续和/或分类预测因素的连续响应变量的常规线性回归模型。它包括多元线性回归，以及方差分析和方差分析（仅含固定效应）。

有限元方法代写

有限元方法（FEM）是一种流行的方法，用于数值解决工程和数学建模中出现的微分方程。典型的问题领域包括结构分析、传热、流体流动、质量运输和电磁势等传统领域。

有限元是一种通用的数值方法，用于解决两个或三个空间变量的偏微分方程（即一些边界值问题）。为了解决一个问题，有限元将一个大系统细分为更小、更简单的部分，称为有限元。这是通过在空间维度上的特定空间离散化来实现的，它是通过构建对象的网格来实现的：用于求解的数值域，它有有限数量的点。边界值问题的有限元方法表述最终导致一个代数方程组。该方法在域上对未知函数进行逼近。[1] 然后将模拟这些有限元的简单方程组合成一个更大的方程系统，以模拟整个问题。然后，有限元通过变化微积分使相关的误差函数最小化来逼近一个解决方案。

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

随机分析代写

随机微积分是数学的一个分支，对随机过程进行操作。它允许为随机过程的积分定义一个关于随机过程的一致的积分理论。这个领域是由日本数学家伊藤清在第二次世界大战期间创建并开始的。

时间序列分析代写

随机过程，是依赖于参数的一组随机变量的全体，参数通常是时间。随机变量是随机现象的数量表现，其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值（如1秒，5分钟，12小时，7天，1年），因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中，往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录，以得到其自身发展的规律。

回归分析代写

多元回归分析渐进（Multiple Regression Analysis Asymptotics）属于计量经济学领域，主要是一种数学上的统计分析方法，可以分析复杂情况下各影响因素的数学关系，在自然科学、社会和经济学等多个领域内应用广泛。

MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习和应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP5318

Posted on 2023年7月24日2023年8月25日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Unintentional obfuscation: Could you read this if you didn’t write it?

A rather unique form of ML hubris materializes in the form of code development practices. Sometimes malicious, many times driven by ego (and a desire to be revered), but mostly due to inexperience and fear, this particular destructive activity takes shape through the creation of unintelligibly complex code.

For our scenario, let’s take a look at a common and somewhat simplistic task: recasting data types to support feature-engineering tasks. In this journey of comparative examples, we’ll take a dataset whose features (and the target field) need to have their types modified to support the pipeline-enabled processing stages to build a model. This problem, at its most simplistic implementation, is shown in the next listing.

From this relatively simple and imperative-style implementation of casting fields in a DataFrame, we’ll look at examples of obfuscation and discuss the impacts that each might have for something as seemingly simple as this use case.
NOTE In the next section, we’ll look at bad habits that some ML engineers have when writing code. Listing 13.3, it must be mentioned, is not intended to be disparaging in its approach and implementation. There is nothing wrong with an imperative approach when building ML code bases (provided the code base doesn’t have tight coupling requiring dozens of edits if one column changes). It becomes a problem only when the complexity of the solution makes modifying imperative code a burden. If the project is simple enough, stick with simpler code. You’ll thank yourself for the simplicity when you need to modify it and add new features.

计算机代写|机器学习代写machine learning代考|The flavors of obfuscation

This section progresses through a sliding scale of complexity, with code examples that become progressively less intelligible, more complex, and increasingly harder to maintain. We’ll analyze bad habits of some developers to aid you in identifying these coding patterns and to call them out for what they are-crippling to productivity and absolutely requiring refactoring to be maintainable.

If you find yourself going down one of these rabbit holes, these examples can serve as a reminder to not follow these patterns. But before we get to the examples, let’s look at the personas that I’ve seen with respect to development habits, shown in figure 13.3 .

These personas are not meant to identify a particular person, but rather to describe traits that a DS may go through during their journey of becoming a better developer. A nearly overwhelming number of people I’ve met (as well as myself)

started off writing code as the Hacker. We’d find ourselves stuck on a problem that we’d never encountered before and instantly move to search online for a solution, copy someone’s code, and if it worked, move on. (I’m not saying that looking on the internet or in books for information is a bad thing; even the most experienced developers do this quite frequently.)

As coding experience becomes deeper, some may lean toward one of the other three coding styles or, if they’re mentored properly, move directly to the center region. Some people have something to prove-usually only to themselves, as most people just want their peers to write the sort of code that comes from a Good Samaritan developer. Others may feel that the least number of lines of code is an effective development strategy, though they’re sacrificing legibility, extensibility, and testability in the process. Figure 13.4 shows the patterns that I’ve come across (and personally experienced).

This circuitous path leads to increasingly complex and unnecessarily complicated implementations before landing on the pinnacle of wisdom-fueled experience. The best we can hope for while making this journey is to have the ability to recognize and learn the better path-specifically, that the simplest solution to a problem (that still meets the requirements of the task) is always the best way to solve it.

机器学习代考

计算机代写|机器学习代写machine learning代考|Unintentional obfuscation: Could you read this if you didn’t write it?

ML傲慢的一种相当独特的形式体现在代码开发实践中。有时是恶意的，很多时候是由自我(和被尊敬的欲望)驱动的，但主要是由于缺乏经验和恐惧，这种特殊的破坏性活动通过创建难以理解的复杂代码而形成。

对于我们的场景，让我们看一看一个常见的、有点简单的任务:重铸数据类型以支持特征工程任务。在这个比较示例的旅程中，我们将采用一个数据集，其特征(和目标字段)需要修改其类型，以支持支持管道的处理阶段，以构建模型。这个问题最简单的实现如下面的清单所示。

从这个相对简单的、命令式的在DataFrame中强制转换字段的实现开始，我们将看到一些混淆的例子，并讨论每个例子对像这个用例这样看似简单的用例可能产生的影响。
在下一节中，我们将看看一些ML工程师在编写代码时的坏习惯。必须提到的是，清单13.3并不是要贬低它的方法和实现。在构建ML代码库时，命令式方法没有什么问题(前提是代码库没有紧密耦合，如果一个列发生变化，需要进行数十次编辑)。只有当解决方案的复杂性使得修改命令式代码成为负担时，它才会成为一个问题。如果项目足够简单，坚持使用更简单的代码。当您需要修改它并添加新功能时，您会感谢自己的简单性。

计算机代写|机器学习代写machine learning代考|The flavors of obfuscation

本节通过复杂性的滑动刻度进行进展，代码示例逐渐变得越来越不容易理解，越来越复杂，并且越来越难以维护。我们将分析一些开发人员的坏习惯，以帮助您识别这些编码模式，并指出它们对生产力的影响，以及绝对需要重构才能维护的地方。

如果你发现自己掉进了其中一个兔子洞，这些例子可以提醒你不要遵循这些模式。但是在我们开始示例之前，让我们看一下我所看到的关于开发习惯的角色，如图13.3所示。

这些角色并不是为了识别一个特定的人，而是为了描述DS在成为一名更好的开发人员的过程中可能经历的特征。我见过的绝大多数人(包括我自己)

以黑客的身份开始写代码。我们会发现自己被一个从未遇到过的问题卡住了，然后立即上网搜索解决方案，复制别人的代码，如果有效，就继续前进。(我并不是说在网上或书本上寻找信息是一件坏事;即使是最有经验的开发者也会经常这么做。)

随着编码经验的深入，一些人可能会倾向于其他三种编码风格中的一种，或者，如果他们得到适当的指导，直接进入中心区域。有些人需要证明一些东西——通常只向他们自己证明，因为大多数人只是希望他们的同伴编写来自好心人开发人员的那种代码。其他人可能觉得最少的代码行数是一种有效的开发策略，尽管他们在过程中牺牲了易读性、可扩展性和可测试性。图13.4显示了我遇到的(和亲身经历过的)模式。

这条迂回的道路会导致越来越复杂和不必要的复杂实现，然后才会到达智慧驱动体验的顶峰。在这段旅程中，我们所能期望的最好的结果是有能力识别和学习更好的路径——特别是，解决问题的最简单的解决方案(仍然满足任务的要求)总是解决问题的最佳方法。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Clarifying correlation vs. causation

Posted on 2023年7月7日2023年7月7日 by statistics-lab

如果你也在怎样代写机器学习Machine Learning 这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。机器学习Machine Learning是一个致力于理解和建立 “学习 “方法的研究领域，也就是说，利用数据来提高某些任务的性能的方法。机器学习算法基于样本数据（称为训练数据）建立模型，以便在没有明确编程的情况下做出预测或决定。机器学习算法被广泛用于各种应用，如医学、电子邮件过滤、语音识别和计算机视觉，在这些应用中，开发传统算法来执行所需任务是困难的或不可行的。

机器学习Machine Learning程序可以在没有明确编程的情况下执行任务。它涉及到计算机从提供的数据中学习，从而执行某些任务。对于分配给计算机的简单任务，有可能通过编程算法告诉机器如何执行解决手头问题所需的所有步骤；就计算机而言，不需要学习。对于更高级的任务，由人类手动创建所需的算法可能是一个挑战。在实践中，帮助机器开发自己的算法，而不是让人类程序员指定每一个需要的步骤，可能会变得更加有效。

计算机代写|机器学习代写machine learning代考|Clarifying correlation vs. causation

An important part of presenting model results to a business unit is to be clear about the differences between correlation and causation. If there is even a slight chance of business leaders inferring a causal relationship from anything that you are showing them, it’s best to have this chat.

Correlation is simply the relationship or association that observed variables have to one another. It does not imply any meaning apart from the existence of this relationship. This concept is inherently counterintuitive to laypersons who are not involved in analyzing data. Making reductionist conclusions that “seem to make sense” about the data relationships in an analysis is effectively how our brains are wired.

For example, we could collect sales data for ice cream trucks and sales of mittens, both aggregated by week of year and country. We could calculate a strong negative correlation between the two (ice cream sales go up as mitten sales increase, and vice versa). Most people would chuckle at a conclusion of causality: “Well, if we want to sell more ice cream, we need to reduce our supply of mittens!”

What a layperson might instantly state from such a silly example is, “Well, people buy mittens when it’s cold and ice cream when it’s hot.” This is an attempt at defining causation. Based on this negative correlation in the observed data, we definitely can’t make such an inference regarding causation. We have no way of knowing what actually influenced the effect of purchasing ice cream or mittens on an individual basis (per observation).

If we were to introduce an additional confounding variable to this analysis (outside temperature), we might find additional confirmation of our spurious conclusion. However, this ignores the complexity of what drives decisions to purchase. As an example, see figure 11.7.

It’s clear that a relationship is present. As temperature increases, ice cream sales increase as well. The relationship being exhibited is fairly strong. But can we infer anything other than the fact that there is a relationship?

Let’s look at another plot. Figure 11.8 shows an additional observational data point that we could put into a model to aid in predicting whether someone might want to buy our ice cream.

计算机代写|机器学习代写machine learning代考|Leveraging A/B testing for attribution calculations

In the previous section, we established the importance of attribution measurement. For our ice cream coupon model, we defined a methodology to split our customer base into different cohort segments to minimize latent variable influence. We’ve defined why it’s so critical to evaluate the success criteria of our implementation based on business metrics associated with what we’re trying to improve (our revenue).

Armed with this understanding, how do we go about calculating the impact? How can we make an adjudication that is mathematically sound and provides an irrefutable assessment of something as complex as a model’s impact on the business?
A/B testing 101
Now that we have defined our cohorts by using a simple percentile-based RFM segmentation (the three groups that we assigned to customers in section 11.1.1), we’re ready to conduct random stratified sampling of our customers to determine which coupon experience they will get.

The control group will be getting the pre-ML treatment of a generic coupon being sent to their inbox on Mondays at 8 a.m. PST. The test group will be getting the targeted content and delivery timing.
NOTE Although simultaneously releasing multiple elements of a project that are all significant departures from the control conditions may seem counterintuitive for hypothesis testing (and it is confounding to a causal relationship), most companies are (wisely) willing to forego scientific accuracy of evaluations in the interest of getting a solution out into the world as soon as possible. If you’re ever faced with this supposed violation of statistical standards, my best advice is this: keep patiently quiet and realize that you can do variation tests later by changing aspects of the implementation in further A/B tests to determine causal impacts to the different aspects of your solution. When it’s time to release a solution, it’s often much more worthwhile to release the best possible solution first and then analyze components later.
Within a short period after production release, people typically want to see plots illustrating the impact as soon as the data starts rolling in. Many line charts will be created, aggregating business parameter results based on the control and test group. Before letting everyone go hog wild with making fancy charts, a few critical aspects of the hypothesis test need to be defined to make it a successful adjudication.

机器学习代考

计算机代写|机器学习代写machine learning代考|Clarifying correlation vs. causation

将模型结果呈现给业务单位的一个重要部分是明确相关性和因果关系之间的区别。如果商业领袖有一点点机会从你展示给他们的任何东西中推断出因果关系，那么最好和他们谈谈。

相关性仅仅是观察到的变量之间的关系或关联。除了这种关系的存在，它没有任何意义。对于不参与数据分析的外行来说，这个概念本质上是违反直觉的。对分析中的数据关系做出“似乎有意义”的简化主义结论，实际上是我们大脑的连接方式。

例如，我们可以收集冰淇淋车的销售数据和连指手套的销售数据，它们都是按周和国家进行汇总的。我们可以计算出两者之间强烈的负相关关系(冰淇淋销量上升，手套销量上升，反之亦然)。大多数人会对因果关系的结论窃笑:“嗯，如果我们想卖更多的冰淇淋，我们需要减少我们的连指手套的供应!”

对于这样一个愚蠢的例子，一个外行人可能会立即说:“嗯，人们在冷的时候买手套，在热的时候买冰淇淋。”这是一个定义因果关系的尝试。根据观察到的数据中的这种负相关，我们肯定不能对因果关系做出这样的推断。我们无法知道究竟是什么影响了个人购买冰淇淋或手套的效果(每次观察)。

如果我们在这个分析中引入一个额外的混淆变量(室外温度)，我们可能会发现我们的错误结论得到了额外的证实。然而，这忽略了驱动购买决策的因素的复杂性。如图11.7所示。

很明显，关系是存在的。随着气温的升高，冰淇淋的销量也会增加。所展示的关系是相当强的。但除了两者之间存在关系这一事实，我们还能推断出什么吗?

让我们看另一个图。图11.8显示了一个额外的观察数据点，我们可以将其放入模型中，以帮助预测某人是否可能想要购买我们的冰淇淋。

计算机代写|机器学习代写machine learning代考|Leveraging A/B testing for attribution calculations

在前一节中，我们确定了归因测量的重要性。对于我们的冰淇淋优惠券模型，我们定义了一种方法，将我们的客户群划分为不同的队列细分，以最小化潜在变量的影响。我们已经定义了为什么基于与我们正在努力改善的(我们的收入)相关的业务指标来评估我们实施的成功标准是如此重要。

有了这样的认识，我们该如何计算影响呢?我们如何才能做出一个在数学上合理的裁决，并对像模型对业务的影响这样复杂的事情提供无可辩驳的评估?
A/B测试101
现在，我们已经通过使用简单的基于百分位数的RFM细分(我们在11.1.1节中分配给客户的三组)定义了我们的队列，我们准备对客户进行随机分层抽样，以确定他们将获得哪种优惠券体验。

控制组将获得ml前处理的通用优惠券被发送到他们的收件箱在周一上午8点太平洋标准时间。测试组将获得目标内容和交付时间。
注:虽然同时发布一个项目的多个元素，这些元素都明显偏离控制条件，对于假设检验来说似乎是违反直觉的(而且它混淆了因果关系)，但大多数公司(明智地)愿意放弃评估的科学准确性，以便尽快将解决方案推向世界。如果你曾经遇到过这种违反统计标准的情况，我最好的建议是:耐心保持沉默，并意识到你可以在以后的A/B测试中通过改变执行方面来进行变异测试，以确定对解决方案不同方面的因果影响。在发布解决方案的时候，通常更值得先发布最好的解决方案，然后再分析组件。
在产品发布后的短时间内，人们通常希望在数据开始涌入时立即看到说明影响的图表。将创建许多折线图，根据控制和测试组聚合业务参数结果。在让每个人都疯狂地制作花哨的图表之前，需要定义假设检验的几个关键方面，以使其成为成功的裁决。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Use of global mutable objects

Posted on 2023年7月7日2023年7月7日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Use of global mutable objects

Continuing our exploration of our new team’s existing code base, we’re tackling another new feature to be added. This one adds completely new functionality. In the process of developing it, we realize that a large portion of the necessary logic for our branch already exists and we simply need to reuse a few methods and a function. What we fail to see is that the function uses a declaration of a globally scoped variable. When running our tests for our branch in isolation (through unit tests), everything works exactly as intended. However, the integration test of the entire code base produces a nonsensical result.

After hours of searching through the code, walking through debugging traces, we find that the state of the function that we were using actually changed from its first usage, and the global variable that the function was using actually changed, rendering our second use of it completely incorrect. We were burned by mutation.
How mutability can burn you
Recognizing how dangerous mutability is can be a bit tricky. Overuse of mutating values, shifting state, and overwriting of data can take many forms, but the end result is typically the same: an incredibly complicated series of bugs. These bugs can manifest themselves in different ways: Heisenbugs seemingly disappear when you’re trying to investigate them, and Mandelbugs are so complex and nondeterministic that they seem to be as complex as a fractal. Refactoring code bases that are riddled with mutation is nontrivial, and many times it’s simply easier to start over from scratch to fix the design flaws.
Issues with mutation and side effects typically don’t rear their heads until long after the initial MVP of a project. Later, in the development process or after a production release, flawed code bases relying on mutability and side effects start to break apart at the seams. Figure 10.3 shows an example of the nuances between different languages and their execution environments and why mutability concerns might not be as apparent, depending on which languages you’re familiar with.

For simplicity’s sake, let’s say that we’re trying to keep track of some fields to include in separate vectors used in an ensemble modeling problem. The following listing shows a simple function that contains a default value within the function signature’s parameters which, when used a single time, will provide the expected functionality.

计算机代写|机器学习代写machine learning代考|Encapsulation to prevent mutable side effects

By knowing that the Python functions maintain state (and everything is mutable in this language), we could have anticipated this behavior. Instead of applying a default argument to maintain isolation and break the object-mutation state, we should have initialized this function with a state that could be checked against.

By performing this simple state validation, we are letting the interpreter know that in order to satisfy the logic, a new object needs to be created to store the new list of values. The proper implementation for checking on instance state in Python for collection mutation is shown in the following listing.

Seemingly small issues like this can create endless headaches for the person (or team) implementing a project. Typically, these sorts of problems are developed early on, showing no issues while the modules are being built out. Even simple unit tests that validate this functionality in isolation will appear to be functioning correctly.

It is typically toward the midpoint of an MVP that issues involving mutability begin to rear their ugly heads. As greater complexity is built out, functions and classes may be utilized multiple times (which is a desired pattern in development), and if not implemented properly, what was seeming to work just fine before now results in difficult-to-troubleshoot bugs.
PRO TIP It’s best to become familiar with the way your development language handles objects, primitives, and collections. Knowing these core nuances of the language will give you the tools necessary to guide your development in a way that won’t create more work and frustration for you throughout the process.
A note on encapsulation
Throughout this book, you’ll see multiple references to me beating a dead horse about using functions in favor of declarative code. You’ll also notice references to favoring classes and methods to functions. This is all due to the overwhelming benefits that come with using encapsulation (and abstraction, but that’s another story discussed elsewhere in the text).
Encapsulating code has two primary benefits:

Restricting end-user access to internal protected functionality, state, or data

Enforcing execution of logic on a bundle of the data being passed in and the logic contained within the method

机器学习代考

计算机代写|机器学习代写machine learning代考|Use of global mutable objects

继续探索我们新团队现有的代码库，我们正在处理另一个要添加的新特性。这款添加了全新的功能。在开发它的过程中，我们意识到我们分支所需的大部分逻辑已经存在，我们只需要重用一些方法和一个函数。我们没有看到的是，该函数使用了一个全局作用域变量的声明。当为分支单独运行测试时(通过单元测试)，一切都完全按照预期工作。然而，整个代码库的集成测试产生了一个无意义的结果。

经过几个小时的代码搜索，通过调试跟踪，我们发现我们使用的函数的状态实际上与第一次使用时发生了变化，函数使用的全局变量实际上也发生了变化，导致我们对它的第二次使用完全不正确。我们被突变所灼伤。
可变性会如何毁掉你
认识到可变性有多危险可能有点棘手。过度使用变异值、转换状态和覆盖数据可以采取多种形式，但最终结果通常是相同的:一系列令人难以置信的复杂错误。这些错误可以以不同的方式表现出来:当你试图调查它们时，海森堡错误似乎会消失，而曼德尔bug是如此复杂和不确定，以至于它们看起来像分形一样复杂。重构充满突变的代码库是非常重要的，很多时候，从头开始修复设计缺陷更容易。
突变和副作用的问题通常在项目的初始MVP完成很久之后才会出现。后来，在开发过程中或产品发布之后，依赖于可变性和副作用的有缺陷的代码库开始在连接处破裂。图10.3显示了不同语言及其执行环境之间细微差别的一个示例，以及为什么可变性问题可能不那么明显，这取决于您熟悉的语言。

为了简单起见，假设我们试图跟踪一些字段，这些字段包含在集成建模问题中使用的单独向量中。下面的清单显示了一个简单的函数，它在函数签名的参数中包含一个默认值，当使用一次时，它将提供预期的功能。

计算机代写|机器学习代写machine learning代考|Encapsulation to prevent mutable side effects

通过了解Python函数维护状态(在这种语言中一切都是可变的)，我们可以预料到这种行为。我们不应该应用默认实参来维持隔离并打破对象突变状态，而应该将该函数初始化为可以检查的状态。

通过执行这个简单的状态验证，我们让解释器知道，为了满足逻辑，需要创建一个新对象来存储新的值列表。在下面的清单中显示了在Python中检查集合突变的实例状态的正确实现。

像这样看似很小的问题可能会给执行项目的人(或团队)带来无尽的头痛。通常，这些类型的问题是在早期开发的，在构建模块时没有显示任何问题。即使是单独验证此功能的简单单元测试也会正常运行。

通常是在MVP的中期，涉及可变性的问题开始浮出水面。随着构建出更大的复杂性，函数和类可能会被多次使用(这是开发中的一种理想模式)，如果没有正确实现，以前看起来工作得很好的东西现在会导致难以排除故障的bug。
专业提示:最好熟悉你的开发语言处理对象、原语和集合的方式。了解语言的这些核心细微差别将为您提供必要的工具，以指导您的开发，而不会在整个过程中为您带来更多的工作和挫折。
关于封装的说明
在本书中，您将看到我在使用函数而不是声明性代码的问题上反复强调的陈词滥调。您还会注意到对类和函数方法的引用。这都是由于使用封装(和抽象，但这是本文其他地方讨论的另一个故事)带来的巨大好处。
封装代码有两个主要好处:

限制最终用户对内部受保护功能、状态或数据的访问

对传入的数据束和方法中包含的逻辑强制执行逻辑

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Walls of text

Posted on 2023年7月7日2023年7月7日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Walls of text

If there was one thing that I learned relatively early in my career as a data scientist, it was that I truly hate debugging. It wasn’t the act of tracking down a bug in my code that frustrated me; rather, it was the process that I had to go through to figure out what went wrong in what I was telling the computer to do.

Like many DS practitioners at the start of their career, when I began working on solving problems with software, I would write a lot of declarative code. I wrote my solutions much in the way that I logically thought about the problem (“I pull my data, then I do some statistical tests, then I make a decision, then I manipulate the data, then I put it in a vector, then into a model …”). This materialized as a long list of actions that flowed directly, one into another. What this programming model meant in the final product was a massive wall of code with no separation or isolation of actions, let alone encapsulation.
Finding the needle in the haystack for any errors in code written in that manner is an exercise in pure, unadulterated torture. The architecture of the code was not conducive to allowing me to figure out which of the hundreds of steps contained therein was causing an issue.

Troubleshooting walls of text (WoT, pronounced What?!) is an exercise in patience that bears few parallels in depth and requisite effort. If you’re the original author of such a display of code, it’s an annoying endeavor (you have no one to hate other than yourself for creating the monstrosity), depressing activity (see prior comment), and time-consuming slog that can be so easily avoided-provided you know how, what, and where to isolate elements within your ML code.

If written by someone else, and you’re the unfortunate heir to the code base, I extend to you my condolences and a hearty “Welcome to the club.” Perhaps a worthy expenditure of your time after fixing the code base would be to mentor the author, provide them with an ample reading list, and help them to never produce such rage-inducing code again.

To have a frame of reference for our discussion, let’s take a look at what one of these WoTs could look like. While the examples in this section are rather simplistic, the intention is to imagine what a complete end-to-end ML project would look like in this format, without having to read through hundreds of lines. (I imagine that you wouldn’t like to flip through dozens of pages of code in a printed book.)

计算机代写|机器学习代写machine learning代考|Considerations for monolithic scripts

Aside from being hard to read, listing 9.1’s biggest flaw is that it’s monolithic. Although it is a script, the principles of WoT development can apply to both functions and methods within classes. This example comes from a notebook, which increasingly is the declarative vehicle used to execute ML code, but the concept applies in a general sense.

Having too much logic within the bounds of an execution encapsulation creates problems (since this is a script run in a notebook, the entire code is one encapsulated block). I invite you to think about these issues through the following questions:

What would it look like if you had to insert new functionality in this block of code?
Would it be easy to test if your changes are correct?
What if the code threw an exception?
How would you go about figuring out what went wrong with the code from an exception being thrown?
What if the structure of the data changed? How would you go about updating the code to reflect those changes?

Before we get into answering some of these questions, let’s look at what this code actually does. Because of the confusing variable names, dense coding structure, and tight coupling of references, we would have to run it to figure out what it’s doing. The next listing shows the first aspect of listing 9.1.

机器学习代考

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

第2章和第3章讨论了关于建模活动的沟通的关键重要性，无论是对业务还是在数据科学家团队之间。不仅能够显示我们的项目解决方案，而且能够有一个可供参考的出处历史，这对于项目的成功同样重要，如果不是更重要的话，甚至比用于解决它的算法更重要。
对于我们在前几章中介绍的预测项目，解决方案的ML方面并不是特别复杂，但问题的严重性却很复杂。由于要对数千个机场进行建模(这反过来意味着要对数千个模型进行调优和跟踪)，处理通信并为每次项目代码的执行提供历史数据参考是一项艰巨的任务。

当在生产中运行我们的预测项目之后，业务单元团队的成员想要解释为什么特定的预测与所收集的数据的最终现实相距甚远时，会发生什么情况?这是许多公司的一个常见问题，这些公司依赖机器学习预测来告知业务运行中应该采取的行动。如果黑天鹅事件发生了，而企业在质疑为什么建模的预测解决方案没有预见到它，你最不想处理的事情就是尝试重新生成模型在某个时间点可能预测到的内容，以便完全解释不可预测的事件是如何无法建模的。
黑天鹅事件是一种不可预见的、多次灾难性的事件，它改变了所获取数据的性质。虽然罕见，但它们会对模特、企业和整个行业产生灾难性的影响。最近的一些黑天鹅事件包括9 / 11恐怖袭击、2008年金融崩溃和Covid-19大流行。由于这些事件的深远和完全不可预测的性质，对模型的影响绝对是毁灭性的。“黑天鹅”一词是纳西姆·尼古拉斯·塔勒布在《黑天鹅:极不可能事件的影响》(兰登书屋，2007年)一书中创造并普及的，涉及到数据和商业。
为了解决ML从业者必须处理的这些棘手问题，MLflow被创建。在本节中，我们将研究MLflow的一个方面是跟踪API，它为我们提供了一个地方来记录所有的调优迭代、每个模型调优运行的指标，以及可以从统一的图形用户界面(GUI)轻松检索和引用的预生成的可视化。

计算机代写|机器学习代写machine learning代考|MLflow tracking

让我们看看第7章(第7.2节)中关于MLflow日志记录的两个基于spark的实现是怎么回事。在该章的代码示例中，在两个不同的地方实例化了MLflow上下文的初始化。
在第一种方法中，使用SparkTrials作为状态管理对象(在驱动程序上运行)，MLflow上下文被放置为run_tuning()函数中整个调优运行的包装器。当使用SparkTrials时，这是编排运行跟踪的首选方法，这样可以很容易地将父运行的各个子运行关联起来，以便从跟踪服务器的GUI中进行查询，以及从REST API请求到涉及过滤器谓词的跟踪服务器进行查询。

图8.1显示了与MLflow的跟踪服务器交互时该代码的图形化表示。代码不仅记录了父封装运行的元数据，还记录了在每个超参数求值发生时来自工作线程的每次迭代日志记录。

在MLflow跟踪服务器的GUI中查看实际的代码表现时，我们可以看到父子关系的结果，如图8.2所示。

相反，用于pandas_udf实现的方法略有不同。在第7章的清单7.10中，Hyperopt执行的每次迭代都需要创建一个新的实验。由于没有父子关系将数据分组在一起，因此需要使用自定义命名和标记的应用程序来支持GUI中的可搜索性，并且对于具有生产能力的代码来说更重要的是REST API。图8.3显示了这种替代方法的日志记录机制的概述(以及这个包含数千个模型的用例的更可伸缩的实现)。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

Posted on 2023年6月21日2023年6月21日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

Chapters 2 and 3 covered the critical importance of communication about modeling activities, both to the business and among a team of fellow data scientists. Being able to not only show our project solutions, but also have a provenance history for reference, is just as important to the project’s success, if not more so, than the algorithms used to solve it.
For the forecasting project that we’ve been covering through the last few chapters, the ML aspect of the solution isn’t particularly complex, but the magnitude of the problem is. With thousands of airports to model (which, in turn, means thousands of models to tune and keep track of), handling communication and having a reference for historical data for each execution of the project code is a daunting task.

What happens when, after running our forecasting project in production, a member of the business unit team wants an explanation as to why a particular forecast was so far off from the eventual reality of the data that is collected? This is a common question from many companies that rely on ML predictions to inform the business about actions that should be taken in running the business. The very last thing that you would want to have to deal with if a black swan event occurs and the business is asking questions about why the modeled forecast solution didn’t foresee it, is having to try to regenerate what the model might have forecasted at a certain point in time in order to fully explain how unpredictable events cannot be modeled.
NOTE A black swan event is an unforeseeable and many times catastrophic event that changes the nature of acquired data. While rare, they can have disastrous effects on models, businesses, and entire industries. Some recent black swan events include the September 11th terrorist attacks, the financial collapse of 2008, and the Covid-19 pandemic. Due to the far-reaching and entirely unpredictable nature of these events, the impact to models can be absolutely devastating. The term “black swan” was coined and popularized in reference to data and business in the book The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb (Random House, 2007).
To solve these intractable issues that ML practitioners have had to deal with historically, MLflow was created. The aspect of MLflow that we’re going to look at in this section is the Tracking API, giving us a place to record all of our tuning iterations, our metrics from each model’s tuning runs, and pre-generated visualizations that can be easily retrieved and referenced from a unified graphical user interface (GUI).

计算机代写|机器学习代写machine learning代考|MLflow tracking

Let’s look at what is going on with the two Spark-based implementations from chapter 7 (section 7.2) as they pertain to MLflow logging. In the code examples shown in that chapter, the initialization of the context for MLflow was instantiated in two distinct places.
In the first approach, using SparkTrials as the state-management object (running on the driver), the MLflow context was placed as a wrapper around the entire tuning run within the function run_tuning (). This is the preferred method of orchestrating the tracking of runs when using SparkTrials so that a parent run’s individual children runs can be associated easily for querying from within the tracking server’s GUI as well as from REST API requests to the tracking server that involve filter predicates.

Figure 8.1 shows a graphical representation of this code when interacting with MLflow’s tracking server. The code records not only the metadata of the parent encapsulating run, but the per iteration logging that occurs from the workers as each hyperparameter evaluation happens.

When looking at the actual code manifestation within the MLflow tracking server’s GUI, we can see the results of this parent-child relationship, shown in figure 8.2.

Conversely, the approach used for the pandas_udf implementation is slightly different. In chapter 7’s listing 7.10, each individual iteration that Hyperopt executes requires the creation of a new experiment. Since there is no child-parent relationship to group the data together, the application of custom naming and tagging is required to allow for searchability within the GUI and-more important for production-capable code-the REST API. The overview of the logging mechanics for this alternative (and more scalable implementation for this use case of thousands of models) is shown in figure 8.3.

机器学习代考

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

计算机代写|机器学习代写machine learning代考|MLflow tracking

在MLflow跟踪服务器的GUI中查看实际的代码表现时，我们可以看到父子关系的结果，如图8.2所示。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Choosing the right tech for the platform and the team

Posted on 2023年6月21日2023年6月21日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Choosing the right tech for the platform and the team

The forecasting scenario we’ve been walking through, when executed in a virtual machine (VM) container and running automated tuning optimization and forecasting for a single airport, worked quite well. We got fairly good results for each airport. By using Hyperopt, we also managed to eliminate the unmaintainable burden of manually tuning each model. While impressive, it doesn’t change the fact that we’re not looking to forecast passengers at just a single airport. We need to create forecasts for thousands of airports.

Figure 7.9 shows what we’ve built, in terms of wall-clock time, in our efforts thus far. The synchronous nature of each airport’s models (in a for loop) and Hyperopt’s Bayesian optimizer (also a serial loop) means that we’re waiting for models to be built one by one, each next step waiting on the previous to be completed, as we discussed in section 7.1.2.

This problem of ML at scale, as shown in this diagram, is a stumbling block for many teams, mostly because of complexity, time, and cost (and is one the primary reasons why projects of this scale are frequently cancelled). Solutions exist for these scalability issues for ML project work; each involves stepping away from the realm of serial execution and moving into the world of distributed, asynchronous, or a mixture of both of these paradigms of computing.

The standard structured code approach for most Python ML tasks is to execute in a serial fashion. Whether it be a list comprehension, a lambda, or a for (while) loop, ML is steeped in sequential execution. This approach can be a benefit, as it reduces memory pressure for many algorithms that have a high memory requirement, particularly those that use recursion, which are many. But this approach can also be a handicap, as it takes much longer to execute, since each subsequent task is waiting for the previous to complete.

We will discuss concurrency in ML briefly in section 7.4 and in more depth in later chapters (both safe and unsafe ways of doing it). For now, with the issue of scalability with respect to wall-clock time for our project, we need to look into a distributed approach to this problem in order to explore our search spaces faster for each airport. It is at this point that we stray from the world of our single-threaded VM approach and move into the distributed computing world of Apache Spark.

计算机代写|机器学习代写machine learning代考|Why Spark?

Why use Spark? In a word: speed.
For the problem that we’re dealing with here, forecasting each month the passenger expectations at each major airport in the United States, we’re not bound by SLAs that are measured in minutes or hours, but we still need to think about the amount of time it takes to run our forecasting. There are multiple reasons for this, chiefly

Time-If we’re building this job as a monolithic modeling event, any failures in an extremely long-running job will require a restart (imagine the job failing after it was $99 \%$ complete, running for 11 days straight).
Stability-We want to be very careful about object references within our job and ensure that we don’t create a memory leak that could cause the job to fail.
Risk-Keeping machines dedicated to extremely long-running jobs (even in cloud providers) risks platform issues that could bring down the job.
Cost-Regardless of where your virtual machines are running, someone is paying the bill for them.

When we focus on tackling these high-risk factors, distributed computing offers a compelling alternative to serial looped execution, not only because of cost, but mostly because of the speed of execution. Were any issues to arise in the job, unforeseen issues with the data, or problems with the underlying hardware that the VMs are running on, these dramatically reduced execution times for our forecasting job will give us flexibility to get the job up and running again with predicted values returning much faster.
A brief note on Spark
Spark is a large topic, a monumentally large ecosystem, and an actively contributed-to open source distributed computing platform based on the Java Virtual Machine (JVM). Because this isn’t a book about Spark per se, I won’t go too deep into the inner workings of it.

Several notable books have been written on the subject, and I recommend reading them if you are inclined to learn more about the technology: Learning Spark by Jules Damji et al. (O’Reilly, 2020), Spark: The Definitive Guide by Bill Chambers and Matei Zaharia (O’Reilly, 2018), and Spark in Action by Jean-Georges Perrin (Manning, 2020).

Suffice it to say, in this book, we will explore how to effectively utilize Spark to perform $\mathrm{ML}$ tasks. Many examples from this point forward are focused on leveraging the power of the platform to perform large-scale $\mathrm{ML}$ (both training and inference).

For the current section, the information covered is relatively high level with respect to how Spark works for these examples; instead, we focus entirely on how we can use it to solve our problems.

机器学习代考

计算机代写|机器学习代写machine learning代考|Choosing the right tech for the platform and the team

当我们在虚拟机(VM)容器中执行预测场景并运行针对单个机场的自动调优优化和预测时，它运行得非常好。我们在每个机场都得到了相当不错的结果。通过使用Hyperopt，我们还设法消除了手动调优每个模型的不可维护的负担。虽然令人印象深刻，但这并不能改变一个事实，即我们并不打算预测单个机场的乘客数量。我们需要为数千个机场做天气预报。

图7.9显示了到目前为止我们所构建的内容，以时钟时间的形式表示。每个机场的模型(在for循环中)和Hyperopt的贝叶斯优化器(也是一个串行循环)的同步特性意味着我们正在等待一个接一个地构建模型，每个下一步等待前一个完成，正如我们在7.1.2节中讨论的那样。

如图所示，大规模的ML问题是许多团队的绊脚石，主要是因为复杂性、时间和成本(这也是这种规模的项目经常被取消的主要原因之一)。针对ML项目工作的这些可扩展性问题存在解决方案;每一种方法都涉及脱离串行执行领域，进入分布式、异步或混合这两种计算范式的世界。

大多数Python ML任务的标准结构化代码方法是以串行方式执行的。无论是列表推导式、lambda还是for (while)循环，ML都沉浸在顺序执行中。这种方法是有好处的，因为它减少了许多对内存要求很高的算法的内存压力，特别是那些使用递归的算法。但是这种方法也可能是一个障碍，因为它需要更长的时间来执行，因为每个后续任务都要等待前一个任务完成。

我们将在第7.4节简要讨论ML中的并发性，并在后面的章节中更深入地讨论并发性(安全和不安全的方法)。现在，由于我们项目的可伸缩性问题与时钟时间有关，我们需要研究一种分布式方法来解决这个问题，以便更快地探索每个机场的搜索空间。正是在这一点上，我们偏离了单线程VM方法的世界，进入了Apache Spark的分布式计算世界。

计算机代写|机器学习代写machine learning代考|Why Spark?

用火花吗?一言以蔽之:速度。
对于我们正在处理的问题，预测每个月在美国各主要机场的乘客期望，我们不受以分钟或小时衡量的sla的约束，但我们仍然需要考虑运行我们的预测所需的时间。主要原因有很多

时间—如果我们将此作业构建为单个建模事件，那么长时间运行的作业中的任何失败都需要重新启动(想象作业在完成99%后失败，连续运行了11天)。

稳定性——我们要非常小心作业中的对象引用，并确保不会造成可能导致作业失败的内存泄漏。

将机器专用于长时间运行的作业(即使在云提供商中)，可能会导致平台问题导致作业中断。

成本—无论您的虚拟机在哪里运行，总有人要为它们买单。

当我们专注于处理这些高风险因素时，分布式计算为串行循环执行提供了一个令人信服的替代方案，不仅是因为成本，而且主要是因为执行速度。如果作业中出现任何问题，数据出现不可预见的问题，或者运行vm的底层硬件出现问题，这些显著减少的预测作业的执行时间将使我们能够灵活地启动并再次运行作业，并以更快的速度返回预测值。
关于Spark的一个简短说明
Spark是一个庞大的主题，一个巨大的生态系统，也是一个基于Java虚拟机(JVM)的开源分布式计算平台。因为这不是一本关于Spark本身的书，所以我不会深入探讨它的内部工作原理。

关于这个主题已经写了几本著名的书，如果你想了解更多关于这项技术的知识，我建议你阅读它们:Jules Damji等人的《学习Spark》(O’Reilly, 2020)， Bill Chambers和Matei Zaharia的《Spark:权威指南》(O’Reilly, 2018)，以及Jean-Georges Perrin的《Spark in Action》(Manning, 2020)。

可以这么说，在本书中，我们将探索如何有效地利用Spark来执行$\ mathm {ML}$任务。从这一点开始，许多例子都集中在利用平台的功能来执行大规模的$\ mathm {ML}$(包括训练和推理)。

对于当前的部分，所涵盖的信息是相对高层次的，关于Spark如何在这些例子中工作;相反，我们完全专注于如何用它来解决我们的问题。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Running quick forecasting tests

Posted on 2023年6月21日2023年6月21日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Running quick forecasting tests

The rapid testing phase is by far the most critical aspect of prototyping to get right. As mentioned in this chapter’s introduction, it is imperative to strive for the middle ground-between not testing enough of the various approaches to determine the tuning sensitivity of each algorithm, and spending inordinate amounts of time building a full MVP solution for each approach. Since time is the most important aspect of this phase, we need to be efficient while making an informed decision about which approach shows the most promise in solving the problem in a robust manner.

Freshly armed with useful and standardized utility functions, each team can work on its respective approaches, rapidly testing to find the most promising model. The team has agreed that the airports under consideration for modeling tests are JFK, EWR, and LGA (each team needs to test its model and tuning paradigms on the same datasets so a fair evaluation of each approach can occur).

Let’s take a look at what the teams will be doing with the different model approaches during rapid testing, what decisions will be made about the approaches, and how the teams can quickly pivot if they find that the approach is going nowhere. The exploratory phase is going to not only uncover nuances of each algorithm but also illuminate aspects of the project that might not have been realized during the preparatory phase (covered in chapter 5). It’s important to remember that this is to be expected and that during this rapid testing phase, the teams should be in frequent communication with one another when they discover these problems (see the following sidebar for tips on effectively managing these discoveries).

计算机代写|机器学习代写machine learning代考|WAIT A MINUTE . . . HOW ARE WE GOING TO CREATE A VALIDATION DATASET?

One group drew the proverbial short straw in the model-testing phase with a forecasting approach to research and test that isn’t particularly well understood by the team. Someone on the team found mention of using a VAR to model multiple time series together (multivariate endogenous series modeling), and thus, this group sets out to research what this algorithm is all about and how to use it.

The first thing that they do is run a search for “vector autoregression,” which results in a massive wall of formulaic theory analysis and mathematical proofs centered primarily around macro-econometrics research and natural sciences utilizations of the model. That’s interesting, but not particularly useful if they want to test out the applications of this model to the data quickly. They next find the statsmodels API documentation for the model.

The team members quickly realize that they haven’t thought about standardizing one common function yet: the split methodology. For most supervised ML problems, they’ve always used pandas split methodologies through DataFrame slicing or utilizing the high-level random split APIs, which use a random seed to select rows for training and test datasets. However, for forecasting, they realize that they haven’t had to do datetime splitting in quite some time and need a deterministic and chronological split method to get accurate forecast validation holdout data. Since the dataset has an index set from the ingestion function’s formatting of the DataFrame, they could probably craft a relatively simple splitting function based on the index position. What they come up with is in the following listing.

机器学习代考

计算机代写|机器学习代写machine learning代考|Running quick forecasting tests

到目前为止，快速测试阶段是原型制作过程中最关键的环节。正如在本章的介绍中所提到的，我们必须努力寻找一个中间点——在没有测试足够多的各种方法来确定每种算法的调优灵敏度，以及花费过多的时间为每种方法构建一个完整的MVP解决方案之间。由于时间是这一阶段最重要的方面，我们需要在做出明智决定的同时提高效率，以确定哪种方法最有希望以稳健的方式解决问题。

有了有用和标准化的实用功能，每个团队都可以在各自的方法上工作，快速测试以找到最有前途的模型。团队已经同意考虑进行建模测试的机场是JFK、EWR和LGA(每个团队需要在相同的数据集上测试其模型和调优范例，以便对每种方法进行公平的评估)。

让我们来看看在快速测试期间，团队将使用不同的模型方法做些什么，将对这些方法做出哪些决策，以及如果团队发现方法没有进展，他们如何快速地进行调整。探索阶段不仅会发现每个算法的细微差别，还会阐明项目在准备阶段可能没有实现的方面(在第5章中介绍)。重要的是要记住，这是预期的，并且在这个快速测试阶段，团队应该在发现这些问题时彼此频繁沟通(请参阅下面的侧栏以获得有效管理这些发现的提示)。

计算机代写|机器学习代写machine learning代考|WAIT A MINUTE . . . HOW ARE WE GOING TO CREATE A VALIDATION DATASET?

一组在模型测试阶段用预测方法进行研究和测试，这不是团队特别理解的，这是众所周知的短稻草。团队中有人发现有人提到使用VAR对多个时间序列进行建模(多元内生序列建模)，因此，这个团队开始研究这个算法是什么以及如何使用它。

他们做的第一件事是搜索“向量自回归”，结果是大量的公式化理论分析和数学证明，主要围绕宏观计量经济学研究和模型的自然科学应用。这很有趣，但如果他们想要快速测试该模型对数据的应用程序，则不是特别有用。接下来，他们为模型找到statmodels API文档。

团队成员很快意识到他们还没有考虑标准化一个公共功能:拆分方法。对于大多数有监督的机器学习问题，他们总是通过DataFrame切片或利用高级随机分割api来使用pandas拆分方法，后者使用随机种子来选择用于训练和测试数据集的行。然而，对于预测，他们意识到他们已经有很长一段时间没有进行日期时间分割了，需要一种确定性和时间顺序分割方法来获得准确的预测验证保留数据。由于数据集具有来自摄取函数对DataFrame格式化的索引集，因此他们可以根据索引位置创建一个相对简单的拆分函数。他们得出的结果如下所示。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Performing data analysis

Posted on 2023年6月8日2023年6月8日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Performing data analysis

In the course of researching possible solutions, a lot of people seem to find trend visualizations pretty helpful. Not only does this activity prepare for baseline visualizations of the data to the broader business unit team that will be the consumers of the project solution, but it can help minimize unforeseen issues with the data that might be uncovered much later in the project; these issues could require a complete rework of the solution (and potentially a cancellation of the project if the rework is too expensive from a time and resources perspective). To marginalize the risk associated with finding out too late about a serious flaw in the data, we’re going to build a few analytics visualizations.

Based on the initial raw data visualization built in listing 5.1 (and shown in figure 5.3), we notice a great deal of noise in the dataset. Having a great deal of noise in a trend can certainly help visualize the general trend line, so let’s start by applying a smoothing function to the raw data trend for the domestic passengers at JFK. The script that we’re going to be executing is in the following listing, utilizing basic matplotlib visualizations.

Running this code in our Jupyter notebook will generate the plot shown in figure 5.5. Note how the general trend of the data looks when smoothed and realize that a definite step function occurs around 2002. Also note that the stddev varies widely during different time periods. After 2008, the variance becomes much broader than it had been historically.

The trend is OK, and somewhat useful for understanding the potential problems that might arise from building training and validation datasets that don’t reflect the trend change. (Specifically, we can see what might happen if we train up to the year 2000 and expect that a model will accurately predict from 2000 to 2015.)

During the research and planning phase, however, we found a great many mentions of stationarity in time series and how certain model types can really struggle with predicting a nonstationary trend. We should take a look at what that is all about.

For this, we’re going to use an augmented Dickey-Fuller stationarity test, provided in the statsmodels module. This test will inform us of whether we need to provide stationarity adjustments to the time series for particular models that are incapable of handling nonstationary data. If the test comes back with a value indicating that the time series is stationary, essentially all models can use the raw data with no transformations applied to it. However, if the data is nonstationary, extra work will be required. The script to run this test for the JFK domestic passengers series is shown next.

计算机代写|机器学习代写machine learning代考|HOW CLEAN IS OUR DATA?

Data cleanliness issues are one of the prime reasons for an MVP extending much longer than was promised to a business. Identifying bad data points is crucial not only for the purposes of modeling training effectiveness, but also to help tell a story to the business about why certain outputs of the model might be less than accurate at times. Building a series of visualizations that can communicate the complexities of latent factors, data-quality issues, and other unforeseen elements that can affect the solution can serve as a powerful tool during discussions with the project’s business unit.

One of the most important points that we’ll have to explain about the forecasting from this project is that it will not, and cannot, be an infallible system. Many unknowns remain in our dataset-elements of influence to the trend that are either too complex to track, too expensive to model, or nearly impossible to predict-that need to feed into the algorithm. For the case of univariate time-series models, nothing is going into the model other than the trending data itself. In the case of more complex implementations, such as windowed approaches and deep learning models like long short-term memory (LSTM) recurrent neural networks (RNNs), even though we can create vectors that contain much more information, we don’t always have the capability or the time to collate all of the features that could influence the trend.

To aid in having this conversation, we can take a look at a simple method of identifying outlier values that are dramatically different from what we would otherwise expect from a seasonally influenced trend. A relatively easy way to do this with series data is to use a differencing function on the sorted data. This can be accomplished as shown in the following listing.

机器学习代考

计算机代写|机器学习代写machine learning代考|Performing data analysis

在研究可能的解决方案的过程中，许多人似乎发现趋势可视化非常有用。此活动不仅为数据的基线可视化做好准备，更广泛的业务单元团队将成为项目解决方案的消费者，而且它可以帮助最小化可能在项目后期发现的数据不可预见的问题;这些问题可能需要对解决方案进行彻底的返工(如果从时间和资源的角度来看，返工过于昂贵，可能会取消项目)。为了排除因发现数据中严重缺陷太晚而带来的风险，我们将构建一些分析可视化。

基于清单5.1中内置的初始原始数据可视化(如图5.3所示)，我们注意到数据集中存在大量噪声。在趋势中有大量的噪声当然可以帮助可视化总体趋势线，因此让我们首先对JFK国内乘客的原始数据趋势应用平滑函数。我们将要执行的脚本在下面的清单中，它利用了基本的matplotlib可视化。

在Jupyter笔记本中运行这段代码将生成如图5.5所示的图。注意数据平滑后的总体趋势，并意识到在2002年左右出现了一个明确的阶跃函数。还要注意，标准开发在不同的时间段变化很大。2008年之后，这种差异比历史上任何时候都要大得多。

这种趋势是可以的，并且在一定程度上有助于理解构建没有反映趋势变化的训练和验证数据集可能产生的潜在问题。(具体来说，如果我们训练到2000年，并期望一个模型能够准确预测2000年到2015年的情况，我们可以看到可能会发生什么。)

然而，在研究和规划阶段，我们发现大量提到时间序列的平稳性，以及某些模型类型如何真正难以预测非平稳趋势。我们应该看看这是怎么回事。

为此，我们将使用statmodels模块中提供的增强Dickey-Fuller平稳性检验。该测试将告知我们是否需要为无法处理非平稳数据的特定模型提供时间序列的平稳性调整。如果测试返回的值表明时间序列是平稳的，那么基本上所有模型都可以使用原始数据，而不需要对其进行转换。但是，如果数据是非平稳的，则需要额外的工作。下面显示为JFK国内乘客系列运行此测试的脚本。

计算机代写|机器学习代写machine learning代考|HOW CLEAN IS OUR DATA?

数据清洁度问题是MVP延长时间远远超过承诺的主要原因之一。识别错误的数据点是至关重要的，这不仅是为了对训练有效性进行建模，而且还有助于向业务人员说明为什么模型的某些输出有时可能不太准确。构建一系列可视化，可以传达潜在因素的复杂性、数据质量问题和其他可能影响解决方案的不可预见的元素，可以作为与项目业务单位讨论期间的强大工具。

关于这个项目的预测，我们必须解释的最重要的一点是，它不会，也不可能是一个绝对正确的系统。我们的数据中仍然存在许多未知因素——影响趋势的因素要么太复杂而无法追踪，要么太昂贵而无法建模，要么几乎不可能预测——这些因素需要输入到算法中。对于单变量时间序列模型，除了趋势数据本身之外，没有任何东西进入模型。在更复杂的实现中，例如窗口方法和长短期记忆(LSTM)递归神经网络(rnn)等深度学习模型，即使我们可以创建包含更多信息的向量，我们也并不总是有能力或时间来整理可能影响趋势的所有特征。

为了帮助进行这一讨论，我们可以看一看一种简单的方法来识别异常值，这些异常值与我们对季节性影响趋势的预期有很大不同。对于序列数据，一种相对简单的方法是对排序后的数据使用差分函数。这可以按照以下清单所示完成。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Perform basic research and planning

Posted on 2023年6月8日2023年6月8日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Perform basic research and planning

The first thing that the team members are going to do, once they get back to their desks after the planning meeting, is look at the data available. Since we’re a peanut manufacturer, and not in any partnership with major airlines, we’re not going to get ticket sales forecasting data. We certainly don’t have time to build web scrapers to attempt to see flight capacity for each airport (nor would anyone want to do this who has ever attempted to build a scraper before). What we do have available, though, is historic passenger capacity that the airport transit authorities provide freely.

We know from figure 5.2 that one of the first actions that we should be doing to understand the nature of the data is to visualize it and run a few statistical analyses we have available. Most people would simply load the data into their local computer’s environment and begin working in a notebook.

This is a recipe for disaster, though. A default Python environment that is running on the main operating system of your primary computer is anything but pristine. To minimize the amount of time wasted on struggling with a development environment (and help prepare for a smooth transition to the development phase later), we need to create a clean environment for our testing. For guidance on getting started with Docker and Anaconda to create a development environment for the code listings in this chapter and all subsequent chapters, see appendix B at the end of this book.

Now that we have an isolated environment (with persistence of the notebook storage location on the container mapped to a local filesystem location), we can get the sample data into this location and create a new notebook for experimentation.

计算机代写|机器学习代写machine learning代考|RESEARCH PHASE

Now that we know some of the concerns with the data-it’s highly seasonal, with trends influenced by latent factors that are wholly unknown to us-we can start researching. Let’s pretend for a moment that no one on the team has ever done time-series forecasting. Where, without the benefit of expert knowledge on the team, should research begin?

Internet searches are a great place to start, but most search results show blog posts of people offering forecasting solutions that involve a great deal of hand-waving and glossing over of the complexities involved in building out a full solution. Whitepapers can be informative but generally don’t focus on the applications of the algorithms that they’re covering. Lastly, script examples from Getting Started guides for different APIs are wonderful for seeing the mechanics of the API signature but are intentionally simplistic to serve as nothing more than a basic starting point, as the name indicates.

So, what should we be looking at to figure out how to predict future months of passenger demand at airports? The short answer is books. Quite a few great ones exist on time-series forecasting. In-depth blogs can help as well, but they should be used exclusively as an initial approach to the problem at hand, rather than as a repository from which to directly copy code.
NOTE The seminal work Time Series Analysis by G. E. P. Box and G. M. Jenkins (Holden-Day, 1970) is widely considered the foundation of all modern timeseries forecasting models. The Box-Jenkins methodologies are the basis for nearly all forecasting implementations today.

机器学习代考

计算机代写|机器学习代写machine learning代考|Perform basic research and planning

团队成员在计划会议结束后回到办公桌后要做的第一件事就是查看可用的数据。由于我们是一家花生制造商，与大型航空公司没有任何合作关系，因此我们无法获得机票销售预测数据。我们当然没有时间构建web scraper来试图查看每个机场的航班容量(也没有人想要这样做，谁曾经试图建立一个scraper之前)。不过，我们确实拥有的是机场运输当局免费提供的历史载客量。

从图5.2中我们知道，为了理解数据的本质，我们应该做的第一个动作是将其可视化，并运行一些可用的统计分析。大多数人会简单地将数据加载到本地计算机环境中，然后开始在笔记本电脑上工作。

然而，这是一个灾难的配方。在主计算机的主操作系统上运行的默认Python环境绝不是原始的。为了尽量减少在开发环境中浪费的时间(并帮助为以后顺利过渡到开发阶段做准备)，我们需要为我们的测试创建一个干净的环境。关于如何开始使用Docker和Anaconda为本章和所有后续章节中的代码清单创建开发环境的指导，请参阅本书末尾的附录B。

现在我们有了一个孤立的环境(容器上的笔记本存储位置的持久性映射到本地文件系统位置)，我们可以将示例数据放入这个位置，并创建一个新的笔记本进行实验。

计算机代写|机器学习代写machine learning代考|RESEARCH PHASE

现在我们知道了对数据的一些担忧——它是高度季节性的，其趋势受到我们完全未知的潜在因素的影响——我们可以开始研究了。让我们暂时假设团队中没有人做过时间序列预测。在没有团队专家知识的情况下，研究应该从哪里开始?

互联网搜索是一个很好的开始，但大多数搜索结果显示的是人们提供预测解决方案的博客文章，这些解决方案涉及大量的手工操作，并掩盖了构建完整解决方案所涉及的复杂性。白皮书可以提供信息，但通常不会关注它们所涵盖的算法的应用。最后，不同API入门指南中的脚本示例对于了解API签名的机制非常有用，但正如其名称所示，它们被故意简化为仅仅作为一个基本的起点。

那么，我们应该通过什么来预测未来几个月机场的乘客需求呢?简而言之就是读书。在时间序列预测方面有不少不错的方法。深度博客也可以提供帮助，但它们应该专门用作解决手头问题的初始方法，而不是直接从中复制代码的存储库。
G. E. P. Box和G. M. Jenkins (Holden-Day, 1970)的开创性工作《时间序列分析》被广泛认为是所有现代时间序列预测模型的基础。Box-Jenkins方法是今天几乎所有预测实现的基础。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写