标签： CS7641

计算机代写|机器学习代写machine learning代考|COMP5328

Posted on 2023年8月11日2023年8月28日 by statistics-lab

如果你也在怎样代写机器学习Machine Learning 这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。机器学习Machine Learning令人兴奋。这是有趣的，具有挑战性的，创造性的，和智力刺激。它还为公司赚钱，自主处理大量任务，并从那些宁愿做其他事情的人那里消除单调工作的繁重任务。

机器学习Machine Learning也非常复杂。从数千种算法、数百种开放源码包，以及需要具备从数据工程(DE)到高级统计分析和可视化等各种技能的专业实践者，ML专业实践者所需的工作确实令人生畏。增加这种复杂性的是，需要能够与广泛的专家、主题专家(sme)和业务单元组进行跨功能工作——就正在解决的问题的性质和ml支持的解决方案的输出进行沟通和协作。

statistics-lab™ 为您的留学生涯保驾护航在代写机器学习 machine learning方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写机器学习 machine learning代写方面经验极为丰富，各种代写机器学习 machine learning相关的作业也就用不着说。

计算机代写|机器学习代写machine learning代考|Biased testing

Internal testing is easy-well, easier than the alternatives. It’s painless (if the model works properly). It’s what we typically think of when we’re qualifying the results of a project. The process typically involves the following:

Generating predictions on new (unseen to the modeling process) data
Analyzing the distribution and statistical properties of the new predictions
Taking random samples of predictions and making qualitative judgments of them
Running handcrafted sample data (or their own accounts, if applicable) through the model

The first two elements in this list are valid for qualification of model effectiveness. They are wholly void of bias and should be done. The latter two, on the other hand, are dangerous. The final one is the more dangerous of them.

In our music playlist generator system scenario, let’s say that the DS team members are all fans of classical music. Throughout their qualitative verifications, they’ve been checking to see the relative quality of the playlist generator for the field of music that they are most familiar with: classical music. To perform these validations, they’ve been generating listening history of their favorite pieces, adjusting the implementation to fine-tune the results, and iterating on the validation process.

When they are fully satisfied that the solution works well at identifying a nearly uncanny level of sophistication for capturing thematic and tonally relevant similar pieces of music, they ask a colleague what they think. The results for both the DS team (Ben and Julie) as well as for their data warehouse engineer friend Connor are shown in figure 15.10.

计算机代写|机器学习代写machine learning代考|Dogfooding

A far more thorough approach than Ben and Julie’s first attempt would have been to canvass people at the company. Instead of keeping the evaluation internal to the team, where a limited exposure to genres hampers their ability to qualitatively measure the effectiveness of the project, they could ask for help. They could ask around and see if people at the company might be interested in taking a look at how their own accounts and usage would be impacted by the changes the DS team is introducing. Figure 15.11 illustrates how this could work for this scenario.

Dogfooding, in the broadest sense, is consuming the results of your own product. The term refers to opening up functionality that is being developed so that everyone at a company can use it, find out how to break it, provide feedback on how it’s broken, and collectively work toward building a better product. All of this happens across a broad range of perspectives, drawing on the experience and knowledge of many employees from all departments.

However, as you can see in figure 15.11, the evaluation still contains bias. An internal user who uses the company’s product is likely not a typical user. Depending on their job function, they may be using their account to validate functionality in the product, use it for demonstrations, or simply interact with the product more because of an employee benefit associated with it.

In addition to the potentially spurious information contained within the listen history of employees, the other form of bias is that people like what they like. They also don’t like what they don’t like. Subjective responses to something as emotionally charged as music preferences add an incredible amount of bias due to the nature of being a member of the human race. Knowing that these predictions are based on their listening history and that it is their own company’s product, internal users evaluating their own profiles will generally be more critical than a typical user if they find something that they don’t like (which is a stark contrast to the builder bias that the DS team would experience).

While dogfooding is certainly preferable to evaluating a solution’s quality within the confines of the DS team, it’s still not ideal, mostly because of these inherent biases that exist.

机器学习代考

计算机代写|机器学习代写machine learning代考|Biased testing

内部测试很容易——好吧，比其他选择更容易。这是无痛的(如果模型工作正常的话)。这是我们在确定项目结果时通常会想到的。这个过程通常包括以下内容:

在新的(建模过程看不到的)数据上生成预测

分析新预测的分布和统计特性

随机抽取预测样本，并对其进行定性判断

通过模型运行手工制作的示例数据(或他们自己的帐户，如果适用的话)

此列表中的前两个元素对于模型有效性的资格是有效的。他们完全没有偏见，应该这样做。另一方面，后两者是危险的。最后一种是更危险的。

在我们的音乐播放列表生成器系统场景中，假设DS团队成员都是古典音乐迷。在他们的定性验证过程中，他们一直在检查他们最熟悉的音乐领域的播放列表生成器的相对质量:古典音乐。为了执行这些验证，他们已经生成了他们最喜欢的片段的收听历史，调整实现以微调结果，并在验证过程中迭代。

当他们完全满意这个解决方案能够很好地识别出一种近乎不可思议的复杂程度，从而捕捉到主题和音调相关的类似音乐片段时，他们就会询问同事自己的看法。DS团队(Ben和Julie)以及他们的数据仓库工程师朋友Connor的结果如图15.10所示。

计算机代写|机器学习代写machine learning代考|Dogfooding

比本和朱莉的第一次尝试更彻底的方法是在公司里游说。与其在团队内部进行评估(游戏邦注:因为对游戏类型的接触有限而阻碍了他们定性地衡量项目的有效性)，他们不如寻求帮助。他们可以四处询问，看看公司里的人是否有兴趣看看他们自己的账户和使用情况会受到DS团队引入的变化的影响。图15.11说明了如何在这个场景中工作。

从最广泛的意义上讲，狗食就是食用自己产品的结果。这个术语指的是开放正在开发的功能，以便公司的每个人都可以使用它，找出如何破坏它，提供关于它如何被破坏的反馈，并共同努力构建更好的产品。所有这些都是在广泛的视角下进行的，利用了各个部门许多员工的经验和知识。

然而，如图15.11所示，评估仍然包含偏差。使用公司产品的内部用户可能不是典型的用户。根据他们的工作职能，他们可能会使用他们的帐户来验证产品中的功能，将其用于演示，或者仅仅是因为与产品相关的员工福利而更多地与产品交互。

除了员工的倾听历史中包含的潜在虚假信息外，另一种形式的偏见是人们喜欢他们喜欢的东西。他们也不喜欢他们不喜欢的东西。对于像音乐偏好这样充满情感的事物的主观反应，由于作为人类一员的本质，增加了难以置信的偏见。知道这些预测是基于他们的收听历史，并且这是他们自己公司的产品，如果内部用户发现他们不喜欢的东西，他们评估自己的资料通常会比普通用户更重要(这与DS团队所经历的构建者偏见形成鲜明对比)。

虽然在DS团队的范围内，狗食肯定比评估解决方案的质量更可取，但它仍然不是理想的，主要是因为存在这些固有的偏见。

计算机代写|机器学习代写machine learning代考请认准statistics-lab™

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

金融工程是使用数学技术来解决金融问题。金融工程使用计算机科学、统计学、经济学和应用数学领域的工具和知识来解决当前的金融问题，以及设计新的和创新的金融产品。

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

术语广义线性模型（GLM）通常是指给定连续和/或分类预测因素的连续响应变量的常规线性回归模型。它包括多元线性回归，以及方差分析和方差分析（仅含固定效应）。

有限元方法代写

有限元方法（FEM）是一种流行的方法，用于数值解决工程和数学建模中出现的微分方程。典型的问题领域包括结构分析、传热、流体流动、质量运输和电磁势等传统领域。

有限元是一种通用的数值方法，用于解决两个或三个空间变量的偏微分方程（即一些边界值问题）。为了解决一个问题，有限元将一个大系统细分为更小、更简单的部分，称为有限元。这是通过在空间维度上的特定空间离散化来实现的，它是通过构建对象的网格来实现的：用于求解的数值域，它有有限数量的点。边界值问题的有限元方法表述最终导致一个代数方程组。该方法在域上对未知函数进行逼近。[1] 然后将模拟这些有限元的简单方程组合成一个更大的方程系统，以模拟整个问题。然后，有限元通过变化微积分使相关的误差函数最小化来逼近一个解决方案。

tatistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

随机分析代写

随机微积分是数学的一个分支，对随机过程进行操作。它允许为随机过程的积分定义一个关于随机过程的一致的积分理论。这个领域是由日本数学家伊藤清在第二次世界大战期间创建并开始的。

时间序列分析代写

随机过程，是依赖于参数的一组随机变量的全体，参数通常是时间。随机变量是随机现象的数量表现，其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值（如1秒，5分钟，12小时，7天，1年），因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中，往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录，以得到其自身发展的规律。

回归分析代写

多元回归分析渐进（Multiple Regression Analysis Asymptotics）属于计量经济学领域，主要是一种数学上的统计分析方法，可以分析复杂情况下各影响因素的数学关系，在自然科学、社会和经济学等多个领域内应用广泛。

MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习和应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP5318

Posted on 2023年8月11日2023年8月28日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Process over technology

The success of a feature store implementation is not in the specific technology used to implement it. The benefit is in the actions it enables a company to take with its calculated and standardized feature data.

Let’s briefly examine an ideal process for a company that needs to update the definition of its revenue metric. For such a broadly defined term, the concept of revenue at a company can be interpreted in many ways, depending on the end-use case, the department concerned with the usage of that data, and the level of accounting standards applied to the definition for those use cases.

A marketing group, for instance, may be interested in gross revenue for measuring the success rate of advertising campaigns. The DE group may define multiple variations of revenue to handle the needs of different groups within the company. The DS team may be looking at a windowed aggregation of any column in the data warehouse that has the words “sales,” “revenue,” or “cost” in it to create feature data. The BI team might have a more sophisticated set of definitions that appeal to a broader set of analytics use cases.

Changing a definition of the logic of such a key business metric can have farreaching impacts to an organization if everyone is responsible for their group’s personal definitions. The likelihood of each group changing its references in each of the queries, code bases, reports, and models that it is responsible for is marginal. Fragmenting the definition of such an important metric across departments is problematic enough on its own. Creating multiple versions of the defining characteristics within each group is a recipe for complete chaos. With no established standard for how key business metrics are defined, groups within a company are effectively no longer speaking on even terms when evaluating the results and outputs from one another.

Regardless of the technology stack used to store the data for consumption, having a process built around change management for critical features can guarantee a frictionless and resilient data migration. Figure 15.4 illustrates such a process.

计算机代写|机器学习代写machine learning代考|The dangers of a data silo

Data silos are deceptively dangerous. Isolating data in a walled-off, private location that is accessible only to a certain select group of individuals stifles the productivity of other teams, causes a large amount of duplicated effort throughout an organization, and frequently (in my experience of seeing them, at least) leads to esoteric data definitions that, in their isolation, depart wildly from the general accepted view of a metric for the rest of the company.

It may seem like a really great thing when an ML team is granted a database of its own or an entire cloud object store bucket to empower the team to be self-service. The seemingly geologically scaled time spent for the DE or warehousing team to load required datasets disappears. The team members are fully masters of their domain, able to load, consume, and generate data with impunity. This can definitely be a good thing, provided that clear and soundly defined processes govern the management of this technology.

But clean or dirty, an internal-use-only data storage stack is a silo, the contents squirreled away from the outside world. These silos can generate more problems than they solve.

To show how a data silo can be disadvantageous, let’s imagine that we work at a company that builds dog parks. Our latest ML project is a bit of a moon shot, working with counterfactual simulations (causal modeling) to determine which amenities would be most valuable to our customers at different proposed construction sites. The goal is to figure out how to maximize the perceived quality and value of the proposed parks while minimizing our company’s investment costs.

To build such a solution, we have to get data on all of the registered dog parks in the country. We also need demographic data associated with the localities of these dog parks. Since the company’s data lake contains no data sources that have this information, we have to source it ourselves. Naturally, we put all of this information in our own environment, thinking it will be far faster than waiting for the DE team’s backlog to clear enough to get around to working on it.

After a few months, questions began to arise about some of the contracts that the company had bid on in certain locales. The business operations team is curious about why so many orders for custom paw-activated watering fountains are being ordered as part of some of these construction inventories. As the analysts begin to dig into the data available in the data lake, they can’t make sense of why the recommendations for certain contracts consistently recommended these incredibly expensive components.

机器学习代考

计算机代写|机器学习代写machine learning代考|Process over technology

功能库实现的成功不在于实现它所使用的特定技术。其好处在于，它使公司能够利用其计算和标准化的特征数据采取行动。

让我们简要地研究一下需要更新收入指标定义的公司的理想流程。对于这样一个定义广泛的术语，公司收入的概念可以用多种方式解释，这取决于最终用例、与该数据的使用有关的部门，以及应用于这些用例定义的会计标准的级别。

例如，一个营销团队可能对毛收入感兴趣，以衡量广告活动的成功率。DE组可以定义多种收入变化来处理公司内不同组的需求。DS团队可能会查看数据仓库中包含“销售”、“收入”或“成本”字样的任何列的窗口聚合，以创建特征数据。BI团队可能拥有更复杂的定义集，以吸引更广泛的分析用例集。

如果每个人都对其团队的个人定义负责，那么更改此类关键业务度量的逻辑定义可以对组织产生深远的影响。每个组在其负责的每个查询、代码库、报告和模型中更改其引用的可能性很小。跨部门划分如此重要的度量标准的定义本身就有足够的问题。在每个组中创建定义特征的多个版本会导致完全的混乱。由于没有关于如何定义关键业务指标的既定标准，公司内部的团队在评估彼此的结果和输出时，实际上不再以平等的方式说话。

无论使用何种技术堆栈来存储供消费的数据，围绕关键特性的变更管理构建流程都可以保证无摩擦且有弹性的数据迁移。图15.4说明了这样一个过程。

计算机代写|机器学习代写machine learning代考|The dangers of a data silo

数据孤岛看起来很危险。将数据隔离在一个封闭的私有位置，只有特定的一组个人可以访问，这会扼杀其他团队的生产力，在整个组织中导致大量的重复工作，并且经常(至少在我看到他们的经验中)导致深奥的数据定义，在他们的隔离中，与公司其他部分普遍接受的度量标准观点背道而驰。

当ML团队被授予自己的数据库或整个云对象存储桶以授权团队进行自助服务时，这似乎是一件非常棒的事情。DE或仓库团队加载所需数据集所花费的时间似乎是按地质比例计算的。团队成员完全掌握了他们的领域，能够不受惩罚地加载、使用和生成数据。这绝对是一件好事，前提是该技术的管理有清晰而完善的流程定义。

但是，无论是干净的还是脏的，仅供内部使用的数据存储堆栈都是一个筒仓，其内容与外部世界隔绝。这些竖井产生的问题比它们解决的问题要多。

为了说明数据孤岛是多么的不利，让我们想象一下，我们在一家建造狗公园的公司工作。我们最新的机器学习项目有点像登月，使用反事实模拟(因果模型)来确定在不同的拟建工地，哪些设施对我们的客户最有价值。我们的目标是找出如何最大限度地提高拟建公园的质量和价值，同时最大限度地降低公司的投资成本。

为了建立这样的解决方案，我们必须获得全国所有注册狗公园的数据。我们还需要与这些狗公园所在地相关的人口统计数据。由于公司的数据湖不包含包含此信息的数据源，因此我们必须自己查找。很自然地，我们把所有这些信息放在我们自己的环境中，认为这样做比等待DE团队的待办事项清理干净以便腾出时间进行工作要快得多。

几个月后，该公司在某些地区投标的一些合同开始出现问题。业务运营团队很好奇，为什么这么多定制的爪动喷水器订单被订购，作为这些建筑库存的一部分。当分析师开始挖掘数据湖中的可用数据时，他们无法理解为什么某些合同的建议总是推荐这些非常昂贵的组件。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|STAT3888

Posted on 2023年8月7日2023年8月7日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Model interpretability

Let’s suppose that we’re working on a problem designed to control forest fires. The organization that we work for can stage equipment, personnel, and services to locations within a large national park system in order to mitigate the chances of wildfires growing out of control. To make logistics effectiveness as efficient as possible, we’ve been tasked with building a solution that can identify risks of fire outbreaks by grid coordinates. We have several years of data, sensor data from each location, and a history of fire-burn area for each grid position.

After building the model and providing the predictions as a service to the logistics team, questions arise about the model’s predictions. The logistics team members notice that certain predictions don’t align with their tribal knowledge of having dealt with fire seasons, voicing concerns about addressing predicted calamities with the feature data that they’re exposed to.

They’ve begun to doubt the solution. They’re asking questions. They’re convinced that something strange is going on and they’d like to know why their services and personnel are being told to cover a grid coordinate in a month that, as far as they can remember, has never had a fire break out.

How can we tackle this situation? How can we run simulations of our feature vector for the prediction through our model and tell them conclusively why the model predicted what it did? Specifically, how can we implement explainable artificial intelligence (XAI) on our model with the minimum amount of effort?

When planning out a project, particularly for a business-critical use case, a frequently overlooked aspect is to think about model explainability. Some industries and companies are the exception to this rule, because of either legal requirements or corporate policies, but for most groups that I’ve interacted with, interpretability is an afterthought.

I understand the reticence that most teams have in considering tacking on XAI functionality to a project. During the course of EDA, model tuning, and QA validation, the DS team generally understands the behavior of the model quite well. Implementing XAI may seem redundant.

By the time you need to explain how or why a model predicted what it did, you’re generally in a panic situation that is already time-constrained. Through implementing XAI processes through straightforward open source packages, this panicked and chaotic scramble to explain functionality of a solution can be avoided.

计算机代写|机器学习代写machine learning代考|Shapley additive explanations

One of the more well-known and thoroughly proven XAI implementations for Python is the shap package, written and maintained by Scott Lundberg. This implementation is fully documented in detail in the 2017 NeurIPS paper “A Unified Approach to Interpreting Model Predictions” by Lundberg and Su-In Lee.

At the core of the algorithm is game theory. Essentially, when we’re thinking of features that go into a training dataset, what is the effect on the model’s predictions for each feature? As with players in a team sport, if a match is the model itself and the features involved in training are the players, what is the effect on the match if one player is substituted for another? How one player’s influence changes the outcome of the game is the basic question that shap is attempting to answer.
FOUNDATION
The principle behind shap involves estimating the contribution of each feature from the training dataset upon the model. According to the original paper, calculating the true contribution (the exact Shapley value) requires evaluating all permutations for each row of the dataset for inclusion and exclusion of the source row’s feature, creating different coalitions of feature groupings.

For instance, if we have three features $\left(\mathrm{a}, \mathrm{b}\right.$, and $\mathrm{c}$; original features denoted with $\mathrm{i}_{\mathrm{i}}$ ), with replacement features from the dataset denoted as ${ }_j$ (for example, $a_j$ ) the coalitions to test for evaluating feature $b$ are as follows:
$$
\left(a_i, b_i, c_j\right),\left(a_i, b_j, c_j\right),\left(a_i, b_j, c_i\right),\left(a_j, b_i, c_j\right),\left(a_j, b_j, c_i\right)
$$
These coalitions of features are run through the model to retrieve a prediction. The resulting prediction is then differenced from the original row’s prediction (and an absolute value taken of the difference). This process is repeated for each feature, resulting in a feature-value contribution score when a weighted average is applied to each delta grouping per feature.

It should come as no surprise that this isn’t a very scalable solution. As the feature count increases and the training dataset’s row count increases, the computational complexity of this approach quickly becomes untenable. Thankfully, another solution is far more scalable: the approximate Shapley estimation.

机器学习代考

计算机代写|机器学习代写machine learning代考|Model interpretability

假设我们正在研究一个控制森林火灾的问题。我们工作的组织可以将设备、人员和服务部署到大型国家公园系统内的各个地点，以减少野火失控的可能性。为了尽可能提高物流效率，我们的任务是建立一个可以通过网格坐标识别火灾爆发风险的解决方案。我们有几年的数据，每个位置的传感器数据，以及每个网格位置的火灾区域历史。

在构建模型并将预测作为服务提供给物流团队之后，出现了关于模型预测的问题。物流团队成员注意到，某些预测与他们处理火灾季节的部落知识不一致，表达了他们对使用他们所接触到的特征数据来处理预测灾难的担忧。

他们开始怀疑这个解决办法了。他们在问问题。他们确信发生了一些奇怪的事情，他们想知道为什么他们的服务和人员被要求在一个月内覆盖一个网格坐标，就他们所记得的，从来没有发生过火灾。

我们如何应对这种情况?我们如何通过我们的模型对预测的特征向量进行模拟，并最终告诉他们为什么模型预测了它所做的事情?具体来说，我们如何以最少的努力在我们的模型上实现可解释的人工智能(XAI) ?

当规划一个项目时，特别是对于业务关键型用例，一个经常被忽视的方面是考虑模型的可解释性。由于法律要求或公司政策，一些行业和公司是这条规则的例外，但对于我接触过的大多数团体来说，可解释性是事后考虑的。

我理解大多数团队在考虑将XAI功能添加到项目中时的沉默。在EDA、模型调优和QA验证过程中，DS团队通常非常了解模型的行为。实现XAI似乎是多余的。

当你需要解释一个模型如何或为什么预测它所做的事情时，你通常已经处于时间有限的恐慌状态。通过直接的开放源码包实现XAI过程，可以避免解释解决方案功能时出现的恐慌和混乱。

计算机代写|机器学习代写machine learning代考|Shapley additive explanations

shap包是Python中比较知名且经过彻底验证的XAI实现之一，它由Scott Lundberg编写和维护。Lundberg和Su-In Lee在2017年NeurIPS论文“解释模型预测的统一方法”中详细记录了这种实现。

算法的核心是博弈论。从本质上讲，当我们考虑进入训练数据集的特征时，每个特征对模型预测的影响是什么?就像团队运动中的球员一样，如果比赛是模型本身，训练中涉及的特征是球员，那么如果一名球员被另一名球员替换，会对比赛产生什么影响?玩家的影响力如何改变游戏结果是《shape》试图回答的基本问题。
基础
shape背后的原理包括估计来自训练数据集的每个特征对模型的贡献。根据原始论文，计算真正的贡献(确切的Shapley值)需要评估数据集每行的所有排列，以包含和排除源行的特征，创建不同的特征组联盟。

例如，如果我们有三个特征$\left(\mathrm{a}, \mathrm{b}\right.$和$\mathrm{c}$;原始特征表示为$\mathrm{i}_{\mathrm{i}}$)，替换特征表示为${ }_j$(例如，$a_j$)，用于评估特征$b$的测试联盟如下:
$$
\left(a_i, b_i, c_j\right),\left(a_i, b_j, c_j\right),\left(a_i, b_j, c_i\right),\left(a_j, b_i, c_j\right),\left(a_j, b_j, c_i\right)
$$
这些特征的联合通过模型来检索预测。然后将得到的预测值与原始行的预测值进行差值(并取差值的绝对值)。对每个特征重复此过程，当对每个特征的每个增量分组应用加权平均值时，产生一个特征值贡献分数。

毫无疑问，这不是一个非常可扩展的解决方案。随着特征数的增加和训练数据集行数的增加，这种方法的计算复杂度很快就会变得站不住脚。值得庆幸的是，另一种解决方案更具可扩展性:近似Shapley估计。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|COMP4318

Posted on 2023年8月7日2023年8月7日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Leaning heavily on prior art

We could use nearly any of the comical examples from table 15.1 to illustrate the first rule in creating fallback plans. Instead, let’s use an actual example from my own personal history.

I once worked on a project that had to deal with a manufacturing recipe. The goal of this recipe was to set a rotation speed on a ludicrously expensive piece of equipment while a material was dripped onto it. The speed of this unit needed to be adjusted periodically throughout the day as the temperature and humidity changed the viscosity of the material being dripped onto the product. Keeping this piece of equipment running optimally was my job; there were many dozens of these stations in the machine and many types of chemicals.

As in so many times in my career, I got really tired of doing a repetitive task. I figured there had to be some way to automate the spin speed of these units so I wouldn’t have to stand at the control station and adjust them every hour or so. Thinking myself rather clever, I wired up a few sensors to a microcontroller, programmed the programmable logic controller to receive the inputs from my little controller, wrote a simple program that would adjust the chuck speed according to the temperature and humidity in the room, and activated the system.

Everything went well, I thought, for the first few hours. I had programmed a simple regression formula into the microcontroller, checked my math, and even tested it on an otherwise broken piece of equipment. It all seemed pretty solid.

It wasn’t until around 3 a.m. that my pager (yes, it was that long ago) started going off. By the time I made it to the factory 20 minutes later, I realized that I had caused an overspeed condition in every single spin chuck system. They stopped. The rest of the liquid dosing system did not. As the chilly breeze struck the back of my head, and I looked out at the open bay doors letting in the $27^{\circ} \mathrm{F}$ night air, I realized my error.
I didn’t have a fallback condition. The regression line, taking in the ambient temperature, tried to compensate for the untested range of data (the viscosity curve wasn’t actually linear at that range), and took a chuck that normally rotated at around 2,800 RPM and tried to instruct it to spin at 15,000 RPM.

I spent the next four days and three nights cleaning up lacquer from the inside of that machine. By the time I was finished, the lead engineer took me aside and handed me a massive three-ring binder and told me to “read it before playing any more games.” (I’m paraphrasing. I can’t put into print what he said to me.) The book was filled with the materials science analysis of each chemical that the machine was using. It had the exact viscosity curves that I could have used. It had information on maximum spin speeds for deposition.

计算机代写|机器学习代写machine learning代考|Cold-start woes

For certain types of ML projects, model prediction failures are not only frequent, but also expected. For solutions that require a historical context of existing data to function properly, the absence of historical data prevents the model from making a prediction. The data simply isn’t available to pass through the model. Known as the cold-start problem, this is a critical aspect of solution design and architecture for any project dealing with temporally associated data.

As an example, let’s imagine that we run a dog-grooming business. Our fleets of mobile bathing stations scour the suburbs of North America, offering all manner of services to dogs at their homes. Appointments and service selection is handled through an app interface. When booking a visit, the clients select from hundreds of options and prepay for the services through the app no later than a day before the visit.

To increase our customers’ satisfaction (and increase our revenue), we employ a service recommendation interface on the app. This model queries the customer’s historical visits, finds products that might be relevant for them, and indicates additional services that the dog might enjoy. For this recommender to function correctly, the historical services history needs to be present during service selection.

This isn’t much of a stretch for anyone to conceptualize. A model without data to process isn’t particularly useful. With no history available, the model clearly has no data in which to infer additional services that could be recommended for bundling into the appointment.

What’s needed to serve something to the end user is a cold-start solution. An easy implementation for this use case is to generate a collection of the most frequently ordered services globally. If the model doesn’t have enough data to provide a prediction, this popularity-based services aggregation can be served in its place. At that point, the app IFrame element will at least have something in it (instead of showing an empty collection) and the user experience won’t be broken by seeing an empty box.

机器学习代考

计算机代写|机器学习代写machine learning代考|Leaning heavily on prior art

我们几乎可以使用表15.1中的任何一个有趣的例子来说明创建后备计划的第一条规则。相反，让我们用我个人经历中的一个实际例子。

我曾经做过一个项目，必须处理一个制造配方。这个配方的目标是在一个昂贵得离谱的设备上设定一个旋转速度，同时把一种材料滴在上面。该装置的速度需要在一天中周期性地调整，因为温度和湿度改变了被滴到产品上的材料的粘度。保持这台设备的最佳运行状态是我的工作;机器里有几十个这样的工作站和许多种类的化学品。

在我的职业生涯中有很多次，我真的厌倦了做重复的工作。我想一定有某种方法可以自动控制这些装置的旋转速度，这样我就不必站在控制站，每隔一小时左右就调整一次。我觉得自己很聪明，于是在一个微控制器上安装了几个传感器，给可编程逻辑控制器编程，让它接收来自我的小控制器的输入，然后写了一个简单的程序，根据房间里的温度和湿度来调整卡盘的速度，然后启动了系统。

我想，在最初的几个小时里，一切都很顺利。我在微控制器中编写了一个简单的回归公式，检查了我的数学计算，甚至在一个坏掉的设备上进行了测试。一切似乎都很可靠。

直到凌晨3点左右，我的呼机才开始响(是的，那是很久以前的事了)。20分钟后，当我到达工厂时，我意识到我已经在每个旋转卡盘系统中造成了超速状态。他们停止了。液体加药系统的其余部分没有。当冷风吹过我的后脑勺时，我望着敞开的门，让27美元的夜晚空气进来，我意识到自己的错误。
我没有退路。回复线考虑了环境温度，试图补偿未测试的数据范围(粘度曲线在该范围内实际上不是线性的)，并选择了一个通常以2800转/分左右旋转的卡盘，并试图指示它以15,000转/分旋转。

接下来的四天三夜我都在清理机器里面的漆。当我完成游戏时，首席工程师把我叫到一边，递给我一个巨大的三环活页夹，并告诉我“在继续玩游戏之前先阅读它。”(我套用。我不能把他对我说的话付梓。)这本书里写满了机器所使用的每种化学物质的材料科学分析。它有我可以用的粘度曲线。它有关于沉积的最大旋转速度的信息。

计算机代写|机器学习代写machine learning代考|Cold-start woes

对于某些类型的ML项目，模型预测失败不仅频繁，而且是意料之中的。对于需要现有数据的历史上下文才能正常工作的解决方案，缺少历史数据会阻止模型进行预测。数据根本无法通过模型。这被称为冷启动问题，对于任何处理临时关联数据的项目来说，这是解决方案设计和体系结构的一个关键方面。

举个例子，假设我们经营一家狗狗美容公司。我们的移动洗浴站遍布北美郊区，为狗狗提供各种上门服务。约会和服务选择是通过应用程序界面处理的。当预约参观时，客户可以从数百个选项中进行选择，并在参观前一天通过应用程序预付服务费用。

为了提高客户的满意度(并增加我们的收入)，我们在应用程序上使用了一个服务推荐界面。这个模型会查询客户的历史访问记录，找到可能与他们相关的产品，并指出狗可能喜欢的其他服务。要使此推荐程序正确运行，在服务选择期间需要提供历史服务历史。

这对任何人来说都不是很容易理解的。没有数据要处理的模型并不是特别有用。由于没有可用的历史记录，该模型显然没有数据来推断可以推荐绑定到约会中的其他服务。

为最终用户提供服务所需要的是冷启动解决方案。此用例的一个简单实现是生成全局最频繁订购的服务的集合。如果模型没有足够的数据来提供预测，则可以使用这种基于流行度的服务聚合。在这一点上，应用程序的IFrame元素至少会有一些东西在里面(而不是显示一个空的集合)，用户体验不会因为看到一个空框而被破坏。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Clarifying correlation vs. causation

Posted on 2023年7月7日2023年7月7日 by statistics-lab

如果你也在怎样代写机器学习Machine Learning 这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。机器学习Machine Learning是一个致力于理解和建立 “学习 “方法的研究领域，也就是说，利用数据来提高某些任务的性能的方法。机器学习算法基于样本数据（称为训练数据）建立模型，以便在没有明确编程的情况下做出预测或决定。机器学习算法被广泛用于各种应用，如医学、电子邮件过滤、语音识别和计算机视觉，在这些应用中，开发传统算法来执行所需任务是困难的或不可行的。

机器学习Machine Learning程序可以在没有明确编程的情况下执行任务。它涉及到计算机从提供的数据中学习，从而执行某些任务。对于分配给计算机的简单任务，有可能通过编程算法告诉机器如何执行解决手头问题所需的所有步骤；就计算机而言，不需要学习。对于更高级的任务，由人类手动创建所需的算法可能是一个挑战。在实践中，帮助机器开发自己的算法，而不是让人类程序员指定每一个需要的步骤，可能会变得更加有效。

计算机代写|机器学习代写machine learning代考|Clarifying correlation vs. causation

An important part of presenting model results to a business unit is to be clear about the differences between correlation and causation. If there is even a slight chance of business leaders inferring a causal relationship from anything that you are showing them, it’s best to have this chat.

Correlation is simply the relationship or association that observed variables have to one another. It does not imply any meaning apart from the existence of this relationship. This concept is inherently counterintuitive to laypersons who are not involved in analyzing data. Making reductionist conclusions that “seem to make sense” about the data relationships in an analysis is effectively how our brains are wired.

For example, we could collect sales data for ice cream trucks and sales of mittens, both aggregated by week of year and country. We could calculate a strong negative correlation between the two (ice cream sales go up as mitten sales increase, and vice versa). Most people would chuckle at a conclusion of causality: “Well, if we want to sell more ice cream, we need to reduce our supply of mittens!”

What a layperson might instantly state from such a silly example is, “Well, people buy mittens when it’s cold and ice cream when it’s hot.” This is an attempt at defining causation. Based on this negative correlation in the observed data, we definitely can’t make such an inference regarding causation. We have no way of knowing what actually influenced the effect of purchasing ice cream or mittens on an individual basis (per observation).

If we were to introduce an additional confounding variable to this analysis (outside temperature), we might find additional confirmation of our spurious conclusion. However, this ignores the complexity of what drives decisions to purchase. As an example, see figure 11.7.

It’s clear that a relationship is present. As temperature increases, ice cream sales increase as well. The relationship being exhibited is fairly strong. But can we infer anything other than the fact that there is a relationship?

Let’s look at another plot. Figure 11.8 shows an additional observational data point that we could put into a model to aid in predicting whether someone might want to buy our ice cream.

计算机代写|机器学习代写machine learning代考|Leveraging A/B testing for attribution calculations

In the previous section, we established the importance of attribution measurement. For our ice cream coupon model, we defined a methodology to split our customer base into different cohort segments to minimize latent variable influence. We’ve defined why it’s so critical to evaluate the success criteria of our implementation based on business metrics associated with what we’re trying to improve (our revenue).

Armed with this understanding, how do we go about calculating the impact? How can we make an adjudication that is mathematically sound and provides an irrefutable assessment of something as complex as a model’s impact on the business?
A/B testing 101
Now that we have defined our cohorts by using a simple percentile-based RFM segmentation (the three groups that we assigned to customers in section 11.1.1), we’re ready to conduct random stratified sampling of our customers to determine which coupon experience they will get.

The control group will be getting the pre-ML treatment of a generic coupon being sent to their inbox on Mondays at 8 a.m. PST. The test group will be getting the targeted content and delivery timing.
NOTE Although simultaneously releasing multiple elements of a project that are all significant departures from the control conditions may seem counterintuitive for hypothesis testing (and it is confounding to a causal relationship), most companies are (wisely) willing to forego scientific accuracy of evaluations in the interest of getting a solution out into the world as soon as possible. If you’re ever faced with this supposed violation of statistical standards, my best advice is this: keep patiently quiet and realize that you can do variation tests later by changing aspects of the implementation in further A/B tests to determine causal impacts to the different aspects of your solution. When it’s time to release a solution, it’s often much more worthwhile to release the best possible solution first and then analyze components later.
Within a short period after production release, people typically want to see plots illustrating the impact as soon as the data starts rolling in. Many line charts will be created, aggregating business parameter results based on the control and test group. Before letting everyone go hog wild with making fancy charts, a few critical aspects of the hypothesis test need to be defined to make it a successful adjudication.

机器学习代考

计算机代写|机器学习代写machine learning代考|Clarifying correlation vs. causation

将模型结果呈现给业务单位的一个重要部分是明确相关性和因果关系之间的区别。如果商业领袖有一点点机会从你展示给他们的任何东西中推断出因果关系，那么最好和他们谈谈。

相关性仅仅是观察到的变量之间的关系或关联。除了这种关系的存在，它没有任何意义。对于不参与数据分析的外行来说，这个概念本质上是违反直觉的。对分析中的数据关系做出“似乎有意义”的简化主义结论，实际上是我们大脑的连接方式。

例如，我们可以收集冰淇淋车的销售数据和连指手套的销售数据，它们都是按周和国家进行汇总的。我们可以计算出两者之间强烈的负相关关系(冰淇淋销量上升，手套销量上升，反之亦然)。大多数人会对因果关系的结论窃笑:“嗯，如果我们想卖更多的冰淇淋，我们需要减少我们的连指手套的供应!”

对于这样一个愚蠢的例子，一个外行人可能会立即说:“嗯，人们在冷的时候买手套，在热的时候买冰淇淋。”这是一个定义因果关系的尝试。根据观察到的数据中的这种负相关，我们肯定不能对因果关系做出这样的推断。我们无法知道究竟是什么影响了个人购买冰淇淋或手套的效果(每次观察)。

如果我们在这个分析中引入一个额外的混淆变量(室外温度)，我们可能会发现我们的错误结论得到了额外的证实。然而，这忽略了驱动购买决策的因素的复杂性。如图11.7所示。

很明显，关系是存在的。随着气温的升高，冰淇淋的销量也会增加。所展示的关系是相当强的。但除了两者之间存在关系这一事实，我们还能推断出什么吗?

让我们看另一个图。图11.8显示了一个额外的观察数据点，我们可以将其放入模型中，以帮助预测某人是否可能想要购买我们的冰淇淋。

计算机代写|机器学习代写machine learning代考|Leveraging A/B testing for attribution calculations

在前一节中，我们确定了归因测量的重要性。对于我们的冰淇淋优惠券模型，我们定义了一种方法，将我们的客户群划分为不同的队列细分，以最小化潜在变量的影响。我们已经定义了为什么基于与我们正在努力改善的(我们的收入)相关的业务指标来评估我们实施的成功标准是如此重要。

有了这样的认识，我们该如何计算影响呢?我们如何才能做出一个在数学上合理的裁决，并对像模型对业务的影响这样复杂的事情提供无可辩驳的评估?
A/B测试101
现在，我们已经通过使用简单的基于百分位数的RFM细分(我们在11.1.1节中分配给客户的三组)定义了我们的队列，我们准备对客户进行随机分层抽样，以确定他们将获得哪种优惠券体验。

控制组将获得ml前处理的通用优惠券被发送到他们的收件箱在周一上午8点太平洋标准时间。测试组将获得目标内容和交付时间。
注:虽然同时发布一个项目的多个元素，这些元素都明显偏离控制条件，对于假设检验来说似乎是违反直觉的(而且它混淆了因果关系)，但大多数公司(明智地)愿意放弃评估的科学准确性，以便尽快将解决方案推向世界。如果你曾经遇到过这种违反统计标准的情况，我最好的建议是:耐心保持沉默，并意识到你可以在以后的A/B测试中通过改变执行方面来进行变异测试，以确定对解决方案不同方面的因果影响。在发布解决方案的时候，通常更值得先发布最好的解决方案，然后再分析组件。
在产品发布后的短时间内，人们通常希望在数据开始涌入时立即看到说明影响的图表。将创建许多折线图，根据控制和测试组聚合业务参数结果。在让每个人都疯狂地制作花哨的图表之前，需要定义假设检验的几个关键方面，以使其成为成功的裁决。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Use of global mutable objects

Posted on 2023年7月7日2023年7月7日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Use of global mutable objects

Continuing our exploration of our new team’s existing code base, we’re tackling another new feature to be added. This one adds completely new functionality. In the process of developing it, we realize that a large portion of the necessary logic for our branch already exists and we simply need to reuse a few methods and a function. What we fail to see is that the function uses a declaration of a globally scoped variable. When running our tests for our branch in isolation (through unit tests), everything works exactly as intended. However, the integration test of the entire code base produces a nonsensical result.

After hours of searching through the code, walking through debugging traces, we find that the state of the function that we were using actually changed from its first usage, and the global variable that the function was using actually changed, rendering our second use of it completely incorrect. We were burned by mutation.
How mutability can burn you
Recognizing how dangerous mutability is can be a bit tricky. Overuse of mutating values, shifting state, and overwriting of data can take many forms, but the end result is typically the same: an incredibly complicated series of bugs. These bugs can manifest themselves in different ways: Heisenbugs seemingly disappear when you’re trying to investigate them, and Mandelbugs are so complex and nondeterministic that they seem to be as complex as a fractal. Refactoring code bases that are riddled with mutation is nontrivial, and many times it’s simply easier to start over from scratch to fix the design flaws.
Issues with mutation and side effects typically don’t rear their heads until long after the initial MVP of a project. Later, in the development process or after a production release, flawed code bases relying on mutability and side effects start to break apart at the seams. Figure 10.3 shows an example of the nuances between different languages and their execution environments and why mutability concerns might not be as apparent, depending on which languages you’re familiar with.

For simplicity’s sake, let’s say that we’re trying to keep track of some fields to include in separate vectors used in an ensemble modeling problem. The following listing shows a simple function that contains a default value within the function signature’s parameters which, when used a single time, will provide the expected functionality.

计算机代写|机器学习代写machine learning代考|Encapsulation to prevent mutable side effects

By knowing that the Python functions maintain state (and everything is mutable in this language), we could have anticipated this behavior. Instead of applying a default argument to maintain isolation and break the object-mutation state, we should have initialized this function with a state that could be checked against.

By performing this simple state validation, we are letting the interpreter know that in order to satisfy the logic, a new object needs to be created to store the new list of values. The proper implementation for checking on instance state in Python for collection mutation is shown in the following listing.

Seemingly small issues like this can create endless headaches for the person (or team) implementing a project. Typically, these sorts of problems are developed early on, showing no issues while the modules are being built out. Even simple unit tests that validate this functionality in isolation will appear to be functioning correctly.

It is typically toward the midpoint of an MVP that issues involving mutability begin to rear their ugly heads. As greater complexity is built out, functions and classes may be utilized multiple times (which is a desired pattern in development), and if not implemented properly, what was seeming to work just fine before now results in difficult-to-troubleshoot bugs.
PRO TIP It’s best to become familiar with the way your development language handles objects, primitives, and collections. Knowing these core nuances of the language will give you the tools necessary to guide your development in a way that won’t create more work and frustration for you throughout the process.
A note on encapsulation
Throughout this book, you’ll see multiple references to me beating a dead horse about using functions in favor of declarative code. You’ll also notice references to favoring classes and methods to functions. This is all due to the overwhelming benefits that come with using encapsulation (and abstraction, but that’s another story discussed elsewhere in the text).
Encapsulating code has two primary benefits:

Restricting end-user access to internal protected functionality, state, or data

Enforcing execution of logic on a bundle of the data being passed in and the logic contained within the method

机器学习代考

计算机代写|机器学习代写machine learning代考|Use of global mutable objects

继续探索我们新团队现有的代码库，我们正在处理另一个要添加的新特性。这款添加了全新的功能。在开发它的过程中，我们意识到我们分支所需的大部分逻辑已经存在，我们只需要重用一些方法和一个函数。我们没有看到的是，该函数使用了一个全局作用域变量的声明。当为分支单独运行测试时(通过单元测试)，一切都完全按照预期工作。然而，整个代码库的集成测试产生了一个无意义的结果。

经过几个小时的代码搜索，通过调试跟踪，我们发现我们使用的函数的状态实际上与第一次使用时发生了变化，函数使用的全局变量实际上也发生了变化，导致我们对它的第二次使用完全不正确。我们被突变所灼伤。
可变性会如何毁掉你
认识到可变性有多危险可能有点棘手。过度使用变异值、转换状态和覆盖数据可以采取多种形式，但最终结果通常是相同的:一系列令人难以置信的复杂错误。这些错误可以以不同的方式表现出来:当你试图调查它们时，海森堡错误似乎会消失，而曼德尔bug是如此复杂和不确定，以至于它们看起来像分形一样复杂。重构充满突变的代码库是非常重要的，很多时候，从头开始修复设计缺陷更容易。
突变和副作用的问题通常在项目的初始MVP完成很久之后才会出现。后来，在开发过程中或产品发布之后，依赖于可变性和副作用的有缺陷的代码库开始在连接处破裂。图10.3显示了不同语言及其执行环境之间细微差别的一个示例，以及为什么可变性问题可能不那么明显，这取决于您熟悉的语言。

为了简单起见，假设我们试图跟踪一些字段，这些字段包含在集成建模问题中使用的单独向量中。下面的清单显示了一个简单的函数，它在函数签名的参数中包含一个默认值，当使用一次时，它将提供预期的功能。

计算机代写|机器学习代写machine learning代考|Encapsulation to prevent mutable side effects

通过了解Python函数维护状态(在这种语言中一切都是可变的)，我们可以预料到这种行为。我们不应该应用默认实参来维持隔离并打破对象突变状态，而应该将该函数初始化为可以检查的状态。

通过执行这个简单的状态验证，我们让解释器知道，为了满足逻辑，需要创建一个新对象来存储新的值列表。在下面的清单中显示了在Python中检查集合突变的实例状态的正确实现。

像这样看似很小的问题可能会给执行项目的人(或团队)带来无尽的头痛。通常，这些类型的问题是在早期开发的，在构建模块时没有显示任何问题。即使是单独验证此功能的简单单元测试也会正常运行。

通常是在MVP的中期，涉及可变性的问题开始浮出水面。随着构建出更大的复杂性，函数和类可能会被多次使用(这是开发中的一种理想模式)，如果没有正确实现，以前看起来工作得很好的东西现在会导致难以排除故障的bug。
专业提示:最好熟悉你的开发语言处理对象、原语和集合的方式。了解语言的这些核心细微差别将为您提供必要的工具，以指导您的开发，而不会在整个过程中为您带来更多的工作和挫折。
关于封装的说明
在本书中，您将看到我在使用函数而不是声明性代码的问题上反复强调的陈词滥调。您还会注意到对类和函数方法的引用。这都是由于使用封装(和抽象，但这是本文其他地方讨论的另一个故事)带来的巨大好处。
封装代码有两个主要好处:

限制最终用户对内部受保护功能、状态或数据的访问

对传入的数据束和方法中包含的逻辑强制执行逻辑

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Walls of text

Posted on 2023年7月7日2023年7月7日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Walls of text

If there was one thing that I learned relatively early in my career as a data scientist, it was that I truly hate debugging. It wasn’t the act of tracking down a bug in my code that frustrated me; rather, it was the process that I had to go through to figure out what went wrong in what I was telling the computer to do.

Like many DS practitioners at the start of their career, when I began working on solving problems with software, I would write a lot of declarative code. I wrote my solutions much in the way that I logically thought about the problem (“I pull my data, then I do some statistical tests, then I make a decision, then I manipulate the data, then I put it in a vector, then into a model …”). This materialized as a long list of actions that flowed directly, one into another. What this programming model meant in the final product was a massive wall of code with no separation or isolation of actions, let alone encapsulation.
Finding the needle in the haystack for any errors in code written in that manner is an exercise in pure, unadulterated torture. The architecture of the code was not conducive to allowing me to figure out which of the hundreds of steps contained therein was causing an issue.

Troubleshooting walls of text (WoT, pronounced What?!) is an exercise in patience that bears few parallels in depth and requisite effort. If you’re the original author of such a display of code, it’s an annoying endeavor (you have no one to hate other than yourself for creating the monstrosity), depressing activity (see prior comment), and time-consuming slog that can be so easily avoided-provided you know how, what, and where to isolate elements within your ML code.

If written by someone else, and you’re the unfortunate heir to the code base, I extend to you my condolences and a hearty “Welcome to the club.” Perhaps a worthy expenditure of your time after fixing the code base would be to mentor the author, provide them with an ample reading list, and help them to never produce such rage-inducing code again.

To have a frame of reference for our discussion, let’s take a look at what one of these WoTs could look like. While the examples in this section are rather simplistic, the intention is to imagine what a complete end-to-end ML project would look like in this format, without having to read through hundreds of lines. (I imagine that you wouldn’t like to flip through dozens of pages of code in a printed book.)

计算机代写|机器学习代写machine learning代考|Considerations for monolithic scripts

Aside from being hard to read, listing 9.1’s biggest flaw is that it’s monolithic. Although it is a script, the principles of WoT development can apply to both functions and methods within classes. This example comes from a notebook, which increasingly is the declarative vehicle used to execute ML code, but the concept applies in a general sense.

Having too much logic within the bounds of an execution encapsulation creates problems (since this is a script run in a notebook, the entire code is one encapsulated block). I invite you to think about these issues through the following questions:

What would it look like if you had to insert new functionality in this block of code?
Would it be easy to test if your changes are correct?
What if the code threw an exception?
How would you go about figuring out what went wrong with the code from an exception being thrown?
What if the structure of the data changed? How would you go about updating the code to reflect those changes?

Before we get into answering some of these questions, let’s look at what this code actually does. Because of the confusing variable names, dense coding structure, and tight coupling of references, we would have to run it to figure out what it’s doing. The next listing shows the first aspect of listing 9.1.

机器学习代考

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

第2章和第3章讨论了关于建模活动的沟通的关键重要性，无论是对业务还是在数据科学家团队之间。不仅能够显示我们的项目解决方案，而且能够有一个可供参考的出处历史，这对于项目的成功同样重要，如果不是更重要的话，甚至比用于解决它的算法更重要。
对于我们在前几章中介绍的预测项目，解决方案的ML方面并不是特别复杂，但问题的严重性却很复杂。由于要对数千个机场进行建模(这反过来意味着要对数千个模型进行调优和跟踪)，处理通信并为每次项目代码的执行提供历史数据参考是一项艰巨的任务。

当在生产中运行我们的预测项目之后，业务单元团队的成员想要解释为什么特定的预测与所收集的数据的最终现实相距甚远时，会发生什么情况?这是许多公司的一个常见问题，这些公司依赖机器学习预测来告知业务运行中应该采取的行动。如果黑天鹅事件发生了，而企业在质疑为什么建模的预测解决方案没有预见到它，你最不想处理的事情就是尝试重新生成模型在某个时间点可能预测到的内容，以便完全解释不可预测的事件是如何无法建模的。
黑天鹅事件是一种不可预见的、多次灾难性的事件，它改变了所获取数据的性质。虽然罕见，但它们会对模特、企业和整个行业产生灾难性的影响。最近的一些黑天鹅事件包括9 / 11恐怖袭击、2008年金融崩溃和Covid-19大流行。由于这些事件的深远和完全不可预测的性质，对模型的影响绝对是毁灭性的。“黑天鹅”一词是纳西姆·尼古拉斯·塔勒布在《黑天鹅:极不可能事件的影响》(兰登书屋，2007年)一书中创造并普及的，涉及到数据和商业。
为了解决ML从业者必须处理的这些棘手问题，MLflow被创建。在本节中，我们将研究MLflow的一个方面是跟踪API，它为我们提供了一个地方来记录所有的调优迭代、每个模型调优运行的指标，以及可以从统一的图形用户界面(GUI)轻松检索和引用的预生成的可视化。

计算机代写|机器学习代写machine learning代考|MLflow tracking

让我们看看第7章(第7.2节)中关于MLflow日志记录的两个基于spark的实现是怎么回事。在该章的代码示例中，在两个不同的地方实例化了MLflow上下文的初始化。
在第一种方法中，使用SparkTrials作为状态管理对象(在驱动程序上运行)，MLflow上下文被放置为run_tuning()函数中整个调优运行的包装器。当使用SparkTrials时，这是编排运行跟踪的首选方法，这样可以很容易地将父运行的各个子运行关联起来，以便从跟踪服务器的GUI中进行查询，以及从REST API请求到涉及过滤器谓词的跟踪服务器进行查询。

图8.1显示了与MLflow的跟踪服务器交互时该代码的图形化表示。代码不仅记录了父封装运行的元数据，还记录了在每个超参数求值发生时来自工作线程的每次迭代日志记录。

在MLflow跟踪服务器的GUI中查看实际的代码表现时，我们可以看到父子关系的结果，如图8.2所示。

相反，用于pandas_udf实现的方法略有不同。在第7章的清单7.10中，Hyperopt执行的每次迭代都需要创建一个新的实验。由于没有父子关系将数据分组在一起，因此需要使用自定义命名和标记的应用程序来支持GUI中的可搜索性，并且对于具有生产能力的代码来说更重要的是REST API。图8.3显示了这种替代方法的日志记录机制的概述(以及这个包含数千个模型的用例的更可伸缩的实现)。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

Posted on 2023年6月21日2023年6月21日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

Chapters 2 and 3 covered the critical importance of communication about modeling activities, both to the business and among a team of fellow data scientists. Being able to not only show our project solutions, but also have a provenance history for reference, is just as important to the project’s success, if not more so, than the algorithms used to solve it.
For the forecasting project that we’ve been covering through the last few chapters, the ML aspect of the solution isn’t particularly complex, but the magnitude of the problem is. With thousands of airports to model (which, in turn, means thousands of models to tune and keep track of), handling communication and having a reference for historical data for each execution of the project code is a daunting task.

What happens when, after running our forecasting project in production, a member of the business unit team wants an explanation as to why a particular forecast was so far off from the eventual reality of the data that is collected? This is a common question from many companies that rely on ML predictions to inform the business about actions that should be taken in running the business. The very last thing that you would want to have to deal with if a black swan event occurs and the business is asking questions about why the modeled forecast solution didn’t foresee it, is having to try to regenerate what the model might have forecasted at a certain point in time in order to fully explain how unpredictable events cannot be modeled.
NOTE A black swan event is an unforeseeable and many times catastrophic event that changes the nature of acquired data. While rare, they can have disastrous effects on models, businesses, and entire industries. Some recent black swan events include the September 11th terrorist attacks, the financial collapse of 2008, and the Covid-19 pandemic. Due to the far-reaching and entirely unpredictable nature of these events, the impact to models can be absolutely devastating. The term “black swan” was coined and popularized in reference to data and business in the book The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb (Random House, 2007).
To solve these intractable issues that ML practitioners have had to deal with historically, MLflow was created. The aspect of MLflow that we’re going to look at in this section is the Tracking API, giving us a place to record all of our tuning iterations, our metrics from each model’s tuning runs, and pre-generated visualizations that can be easily retrieved and referenced from a unified graphical user interface (GUI).

计算机代写|机器学习代写machine learning代考|MLflow tracking

Let’s look at what is going on with the two Spark-based implementations from chapter 7 (section 7.2) as they pertain to MLflow logging. In the code examples shown in that chapter, the initialization of the context for MLflow was instantiated in two distinct places.
In the first approach, using SparkTrials as the state-management object (running on the driver), the MLflow context was placed as a wrapper around the entire tuning run within the function run_tuning (). This is the preferred method of orchestrating the tracking of runs when using SparkTrials so that a parent run’s individual children runs can be associated easily for querying from within the tracking server’s GUI as well as from REST API requests to the tracking server that involve filter predicates.

Figure 8.1 shows a graphical representation of this code when interacting with MLflow’s tracking server. The code records not only the metadata of the parent encapsulating run, but the per iteration logging that occurs from the workers as each hyperparameter evaluation happens.

When looking at the actual code manifestation within the MLflow tracking server’s GUI, we can see the results of this parent-child relationship, shown in figure 8.2.

Conversely, the approach used for the pandas_udf implementation is slightly different. In chapter 7’s listing 7.10, each individual iteration that Hyperopt executes requires the creation of a new experiment. Since there is no child-parent relationship to group the data together, the application of custom naming and tagging is required to allow for searchability within the GUI and-more important for production-capable code-the REST API. The overview of the logging mechanics for this alternative (and more scalable implementation for this use case of thousands of models) is shown in figure 8.3.

机器学习代考

计算机代写|机器学习代写machine learning代考|Logging: Code, metrics, and results

计算机代写|机器学习代写machine learning代考|MLflow tracking

在MLflow跟踪服务器的GUI中查看实际的代码表现时，我们可以看到父子关系的结果，如图8.2所示。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Choosing the right tech for the platform and the team

Posted on 2023年6月21日2023年6月21日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Choosing the right tech for the platform and the team

The forecasting scenario we’ve been walking through, when executed in a virtual machine (VM) container and running automated tuning optimization and forecasting for a single airport, worked quite well. We got fairly good results for each airport. By using Hyperopt, we also managed to eliminate the unmaintainable burden of manually tuning each model. While impressive, it doesn’t change the fact that we’re not looking to forecast passengers at just a single airport. We need to create forecasts for thousands of airports.

Figure 7.9 shows what we’ve built, in terms of wall-clock time, in our efforts thus far. The synchronous nature of each airport’s models (in a for loop) and Hyperopt’s Bayesian optimizer (also a serial loop) means that we’re waiting for models to be built one by one, each next step waiting on the previous to be completed, as we discussed in section 7.1.2.

This problem of ML at scale, as shown in this diagram, is a stumbling block for many teams, mostly because of complexity, time, and cost (and is one the primary reasons why projects of this scale are frequently cancelled). Solutions exist for these scalability issues for ML project work; each involves stepping away from the realm of serial execution and moving into the world of distributed, asynchronous, or a mixture of both of these paradigms of computing.

The standard structured code approach for most Python ML tasks is to execute in a serial fashion. Whether it be a list comprehension, a lambda, or a for (while) loop, ML is steeped in sequential execution. This approach can be a benefit, as it reduces memory pressure for many algorithms that have a high memory requirement, particularly those that use recursion, which are many. But this approach can also be a handicap, as it takes much longer to execute, since each subsequent task is waiting for the previous to complete.

We will discuss concurrency in ML briefly in section 7.4 and in more depth in later chapters (both safe and unsafe ways of doing it). For now, with the issue of scalability with respect to wall-clock time for our project, we need to look into a distributed approach to this problem in order to explore our search spaces faster for each airport. It is at this point that we stray from the world of our single-threaded VM approach and move into the distributed computing world of Apache Spark.

计算机代写|机器学习代写machine learning代考|Why Spark?

Why use Spark? In a word: speed.
For the problem that we’re dealing with here, forecasting each month the passenger expectations at each major airport in the United States, we’re not bound by SLAs that are measured in minutes or hours, but we still need to think about the amount of time it takes to run our forecasting. There are multiple reasons for this, chiefly

Time-If we’re building this job as a monolithic modeling event, any failures in an extremely long-running job will require a restart (imagine the job failing after it was $99 \%$ complete, running for 11 days straight).
Stability-We want to be very careful about object references within our job and ensure that we don’t create a memory leak that could cause the job to fail.
Risk-Keeping machines dedicated to extremely long-running jobs (even in cloud providers) risks platform issues that could bring down the job.
Cost-Regardless of where your virtual machines are running, someone is paying the bill for them.

When we focus on tackling these high-risk factors, distributed computing offers a compelling alternative to serial looped execution, not only because of cost, but mostly because of the speed of execution. Were any issues to arise in the job, unforeseen issues with the data, or problems with the underlying hardware that the VMs are running on, these dramatically reduced execution times for our forecasting job will give us flexibility to get the job up and running again with predicted values returning much faster.
A brief note on Spark
Spark is a large topic, a monumentally large ecosystem, and an actively contributed-to open source distributed computing platform based on the Java Virtual Machine (JVM). Because this isn’t a book about Spark per se, I won’t go too deep into the inner workings of it.

Several notable books have been written on the subject, and I recommend reading them if you are inclined to learn more about the technology: Learning Spark by Jules Damji et al. (O’Reilly, 2020), Spark: The Definitive Guide by Bill Chambers and Matei Zaharia (O’Reilly, 2018), and Spark in Action by Jean-Georges Perrin (Manning, 2020).

Suffice it to say, in this book, we will explore how to effectively utilize Spark to perform $\mathrm{ML}$ tasks. Many examples from this point forward are focused on leveraging the power of the platform to perform large-scale $\mathrm{ML}$ (both training and inference).

For the current section, the information covered is relatively high level with respect to how Spark works for these examples; instead, we focus entirely on how we can use it to solve our problems.

机器学习代考

计算机代写|机器学习代写machine learning代考|Choosing the right tech for the platform and the team

当我们在虚拟机(VM)容器中执行预测场景并运行针对单个机场的自动调优优化和预测时，它运行得非常好。我们在每个机场都得到了相当不错的结果。通过使用Hyperopt，我们还设法消除了手动调优每个模型的不可维护的负担。虽然令人印象深刻，但这并不能改变一个事实，即我们并不打算预测单个机场的乘客数量。我们需要为数千个机场做天气预报。

图7.9显示了到目前为止我们所构建的内容，以时钟时间的形式表示。每个机场的模型(在for循环中)和Hyperopt的贝叶斯优化器(也是一个串行循环)的同步特性意味着我们正在等待一个接一个地构建模型，每个下一步等待前一个完成，正如我们在7.1.2节中讨论的那样。

如图所示，大规模的ML问题是许多团队的绊脚石，主要是因为复杂性、时间和成本(这也是这种规模的项目经常被取消的主要原因之一)。针对ML项目工作的这些可扩展性问题存在解决方案;每一种方法都涉及脱离串行执行领域，进入分布式、异步或混合这两种计算范式的世界。

大多数Python ML任务的标准结构化代码方法是以串行方式执行的。无论是列表推导式、lambda还是for (while)循环，ML都沉浸在顺序执行中。这种方法是有好处的，因为它减少了许多对内存要求很高的算法的内存压力，特别是那些使用递归的算法。但是这种方法也可能是一个障碍，因为它需要更长的时间来执行，因为每个后续任务都要等待前一个任务完成。

我们将在第7.4节简要讨论ML中的并发性，并在后面的章节中更深入地讨论并发性(安全和不安全的方法)。现在，由于我们项目的可伸缩性问题与时钟时间有关，我们需要研究一种分布式方法来解决这个问题，以便更快地探索每个机场的搜索空间。正是在这一点上，我们偏离了单线程VM方法的世界，进入了Apache Spark的分布式计算世界。

计算机代写|机器学习代写machine learning代考|Why Spark?

用火花吗?一言以蔽之:速度。
对于我们正在处理的问题，预测每个月在美国各主要机场的乘客期望，我们不受以分钟或小时衡量的sla的约束，但我们仍然需要考虑运行我们的预测所需的时间。主要原因有很多

时间—如果我们将此作业构建为单个建模事件，那么长时间运行的作业中的任何失败都需要重新启动(想象作业在完成99%后失败，连续运行了11天)。

稳定性——我们要非常小心作业中的对象引用，并确保不会造成可能导致作业失败的内存泄漏。

将机器专用于长时间运行的作业(即使在云提供商中)，可能会导致平台问题导致作业中断。

成本—无论您的虚拟机在哪里运行，总有人要为它们买单。

当我们专注于处理这些高风险因素时，分布式计算为串行循环执行提供了一个令人信服的替代方案，不仅是因为成本，而且主要是因为执行速度。如果作业中出现任何问题，数据出现不可预见的问题，或者运行vm的底层硬件出现问题，这些显著减少的预测作业的执行时间将使我们能够灵活地启动并再次运行作业，并以更快的速度返回预测值。
关于Spark的一个简短说明
Spark是一个庞大的主题，一个巨大的生态系统，也是一个基于Java虚拟机(JVM)的开源分布式计算平台。因为这不是一本关于Spark本身的书，所以我不会深入探讨它的内部工作原理。

关于这个主题已经写了几本著名的书，如果你想了解更多关于这项技术的知识，我建议你阅读它们:Jules Damji等人的《学习Spark》(O’Reilly, 2020)， Bill Chambers和Matei Zaharia的《Spark:权威指南》(O’Reilly, 2018)，以及Jean-Georges Perrin的《Spark in Action》(Manning, 2020)。

可以这么说，在本书中，我们将探索如何有效地利用Spark来执行$\ mathm {ML}$任务。从这一点开始，许多例子都集中在利用平台的功能来执行大规模的$\ mathm {ML}$(包括训练和推理)。

对于当前的部分，所涵盖的信息是相对高层次的，关于Spark如何在这些例子中工作;相反，我们完全专注于如何用它来解决我们的问题。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

金融工程代写

非参数统计代写

非参数统计指的是一种统计方法，其中不假设数据来自于由少数参数决定的规定模型；这种模型的例子包括正态分布模型和线性回归模型。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

有限元方法代写

随机分析代写

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

计算机代写|机器学习代写machine learning代考|Running quick forecasting tests

Posted on 2023年6月21日2023年6月21日 by statistics-lab

计算机代写|机器学习代写machine learning代考|Running quick forecasting tests

The rapid testing phase is by far the most critical aspect of prototyping to get right. As mentioned in this chapter’s introduction, it is imperative to strive for the middle ground-between not testing enough of the various approaches to determine the tuning sensitivity of each algorithm, and spending inordinate amounts of time building a full MVP solution for each approach. Since time is the most important aspect of this phase, we need to be efficient while making an informed decision about which approach shows the most promise in solving the problem in a robust manner.

Freshly armed with useful and standardized utility functions, each team can work on its respective approaches, rapidly testing to find the most promising model. The team has agreed that the airports under consideration for modeling tests are JFK, EWR, and LGA (each team needs to test its model and tuning paradigms on the same datasets so a fair evaluation of each approach can occur).

Let’s take a look at what the teams will be doing with the different model approaches during rapid testing, what decisions will be made about the approaches, and how the teams can quickly pivot if they find that the approach is going nowhere. The exploratory phase is going to not only uncover nuances of each algorithm but also illuminate aspects of the project that might not have been realized during the preparatory phase (covered in chapter 5). It’s important to remember that this is to be expected and that during this rapid testing phase, the teams should be in frequent communication with one another when they discover these problems (see the following sidebar for tips on effectively managing these discoveries).

计算机代写|机器学习代写machine learning代考|WAIT A MINUTE . . . HOW ARE WE GOING TO CREATE A VALIDATION DATASET?

One group drew the proverbial short straw in the model-testing phase with a forecasting approach to research and test that isn’t particularly well understood by the team. Someone on the team found mention of using a VAR to model multiple time series together (multivariate endogenous series modeling), and thus, this group sets out to research what this algorithm is all about and how to use it.

The first thing that they do is run a search for “vector autoregression,” which results in a massive wall of formulaic theory analysis and mathematical proofs centered primarily around macro-econometrics research and natural sciences utilizations of the model. That’s interesting, but not particularly useful if they want to test out the applications of this model to the data quickly. They next find the statsmodels API documentation for the model.

The team members quickly realize that they haven’t thought about standardizing one common function yet: the split methodology. For most supervised ML problems, they’ve always used pandas split methodologies through DataFrame slicing or utilizing the high-level random split APIs, which use a random seed to select rows for training and test datasets. However, for forecasting, they realize that they haven’t had to do datetime splitting in quite some time and need a deterministic and chronological split method to get accurate forecast validation holdout data. Since the dataset has an index set from the ingestion function’s formatting of the DataFrame, they could probably craft a relatively simple splitting function based on the index position. What they come up with is in the following listing.

机器学习代考

计算机代写|机器学习代写machine learning代考|Running quick forecasting tests

到目前为止，快速测试阶段是原型制作过程中最关键的环节。正如在本章的介绍中所提到的，我们必须努力寻找一个中间点——在没有测试足够多的各种方法来确定每种算法的调优灵敏度，以及花费过多的时间为每种方法构建一个完整的MVP解决方案之间。由于时间是这一阶段最重要的方面，我们需要在做出明智决定的同时提高效率，以确定哪种方法最有希望以稳健的方式解决问题。

有了有用和标准化的实用功能，每个团队都可以在各自的方法上工作，快速测试以找到最有前途的模型。团队已经同意考虑进行建模测试的机场是JFK、EWR和LGA(每个团队需要在相同的数据集上测试其模型和调优范例，以便对每种方法进行公平的评估)。

让我们来看看在快速测试期间，团队将使用不同的模型方法做些什么，将对这些方法做出哪些决策，以及如果团队发现方法没有进展，他们如何快速地进行调整。探索阶段不仅会发现每个算法的细微差别，还会阐明项目在准备阶段可能没有实现的方面(在第5章中介绍)。重要的是要记住，这是预期的，并且在这个快速测试阶段，团队应该在发现这些问题时彼此频繁沟通(请参阅下面的侧栏以获得有效管理这些发现的提示)。

计算机代写|机器学习代写machine learning代考|WAIT A MINUTE . . . HOW ARE WE GOING TO CREATE A VALIDATION DATASET?

一组在模型测试阶段用预测方法进行研究和测试，这不是团队特别理解的，这是众所周知的短稻草。团队中有人发现有人提到使用VAR对多个时间序列进行建模(多元内生序列建模)，因此，这个团队开始研究这个算法是什么以及如何使用它。

他们做的第一件事是搜索“向量自回归”，结果是大量的公式化理论分析和数学证明，主要围绕宏观计量经济学研究和模型的自然科学应用。这很有趣，但如果他们想要快速测试该模型对数据的应用程序，则不是特别有用。接下来，他们为模型找到statmodels API文档。

团队成员很快意识到他们还没有考虑标准化一个公共功能:拆分方法。对于大多数有监督的机器学习问题，他们总是通过DataFrame切片或利用高级随机分割api来使用pandas拆分方法，后者使用随机种子来选择用于训练和测试数据集的行。然而，对于预测，他们意识到他们已经有很长一段时间没有进行日期时间分割了，需要一种确定性和时间顺序分割方法来获得准确的预测验证保留数据。由于数据集具有来自摄取函数对DataFrame格式化的索引集，因此他们可以根据索引位置创建一个相对简单的拆分函数。他们得出的结果如下所示。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写