STT 525 - 统计代写答疑辅导

标签： STT 525

统计代写|属性数据分析作业代写analysis of categorical data代考|Association Between Two Categorical Variables

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

属性数据分析analysis of categorical data一属性变量和属性数据，通常所指属性数据，反映事物属性的数据，也称为定性数据或类别数据，它是属性变量取的值。分类数据是指将一个观察结果归入一个或多个类别的数据。例如，一个项目可能被评判为好或坏，或者对调查的反应可能包括同意、不同意或无意见等类别。Statgraphics包括许多处理这类数据的程序，包括包含在方差分析、回归分析和统计过程控制部分的建模程序。

statistics-lab™ 为您的留学生涯保驾护航在代写属性数据分析analysis of categorical data方面已经树立了自己的口碑, 保证靠谱, 高质且原创的统计Statistics代写服务。我们的专家在代写属性数据分析analysis of categorical data方面经验极为丰富，各种代写属性数据分析analysis of categorical data相关的作业也就用不着说。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|Association Between Two Categorical Variables

统计代写|属性数据分析作业代写analysis of categorical data代考|Contingency Tables for Two Categorical Variables

To measure the association between two categorical variables, we use a contingency table that summarizes the (joint) frequencies observed in each category of the variables. For example, as we were first writing this chapter, the race between Hillary Clinton and Barack Obama for the 2008 democratic presidential candidacy was still undecided and very much in the news. Suppose that we would like to know whether there is an association between voter gender and candidate choice in the Wisconsin Democratic Primary.

In an exit poll of $1,442 \mathrm{~W}$ isconsin voters, $42 \%$ males and $58 \%$ females, it was observed that $67 \%$ of the males and $50 \%$ of the females voted for Obama (CNN Election Center, 2008). Table $4.1$ presents the $2-b y-2(2 \times 2)$ contingency table used to summarize the frequencies for the variables of gender (male or female) and candidate choice (Clinton or Obama).

We use Table $4.1$ to introduce some notation and terminology for contingency tables. First, the total number of categories for the row variable is denoted by $I$, with each category

indexed by $i$, while the total number of categories for the column variable is denoted by $J$, with each category indexed by $j$. In our example, Gender has $I=2$ categories (e.g., $i=1$ for Males; $i=2$ for Females) and Candidate has $J=2$ categories (e.g., $j=1$ for Clinton; $j=2$ for Obama). In general, the size of the contingency table is denoted as $I \times J$ (i.e., $2 \times 2$ in our example).

The frequency in each cell of the table, called a joint frequency, is denoted by $n_{i j}$. Each number that appears in boldface in Table $4.1$ is a joint, or cell, frequency. For example, $\mathrm{n}{11}$ in Table $4.1$ represents the number of voters who are male $(i=1)$ and voted for Clinton $(j=1)$, so $\mathrm{n}{11}=200$, while $\mathrm{n}{12}$ in Table $4.1$ represents the number of voters who are male $(i=1)$ and voted for Obama $(j=2)$, so $n{12}=406$. Taken together, the cell frequencies represent the joint distribution of the two categorical variables. It is important to note that each individual observation can only be counted once so it must appear in (or be classified into) one and only one cell of the table.
Each frequency appearing in the margins of the table is called a marginal frequency and represents the row or column total for one category of one variable. A marginal frequency for a row is denoted by $\mathrm{n}{i+}$ and a marginal frequency for a column is denoted by $\mathrm{n}{+j}$. The marginal frequencies are shaded in Table 4.1. For example, the row total or marginal frequency for males in Table $4.1$ is $\mathrm{n}{1+}=606$ (and represents the total number of males in the sample), while the marginal frequency for females is $\mathrm{n}{2+}=836$ (and represents the total number of females in the sample). Similarly, $\mathrm{n}{+1}=618$ is the column marginal frequency for Clinton voters and $n{+2}=824$ is the column marginal frequency for Obama voters. Together, the marginal frequencies for the rows (or columns) represent the marginal distribution of the row (or column) variable. Finally, the overall total number of observations is denoted by $n_{++}$, so in this example $\mathrm{n}_{++}=1442$.

Each of the cell frequencies can be converted to a joint proportion (or probability) by dividing the cell frequency by the total number of observations. In the population these cell proportions are denoted by $\pi_{i,}$, whereas in the sample they are denoted by $p_{i j}=n_{i j} / n_{++}$. Similarly, each of the marginal frequencies $\left(\mathrm{n}{i+}\right.$ or $\left.\mathrm{n}{+}\right)$can be converted to a marginal proportion or probability when divided by the total number of observations. For example, from Table 4.1, the joint proportion of voters who are female and voted for Clinton is $p_{21}=\mathrm{n}{21} /$ $\mathrm{n}{++}=418 / 1442=0.29$, and the marginal proportion of voters who voted for Clinton is $p_{+1}=$ $\mathrm{n}{+1} / \mathrm{n}{++}=618 / 1442=0.43$.

统计代写|属性数据分析作业代写analysis of categorical data代考|Independence

Just as we typically use the correlation coefficient to evaluate the association between two continuous variables, we use a value called the odds ratio to evaluate the association between two categorical variables. Before we define and discuss the odds ratio, however, we expand a

bit on the idea of independence between two variables, which is a key concept in categorical data analysis.

When two categorical variables are independent of each other, they are not associated. For example, if gender and candidate choice are independent variables, then one is not associated with the other, meaning that we would be able to predict candidate choice just as well regardless of whether we knew the voter’s gender. Thus, if knowing a voter’s gender does not help to predict the candidate chosen by that voter, then there is no relationship between gender and candidate choice and these two variables are independent. Further, if knowing the value (category) of one variable has no effect on predicting the value (category) of the other, then the column probability distribution should be the same in each row and the row probability distribution should be the same in each column. In our example (Table 4.1), this would mean that the overall candidate (column) probability distribution of $43 \%(618 / 1442$ ) for Clinton and $57 \%(824 / 1442)$ for Obama should also be the candidate choice distribution obtained for both males and females. That is, if independence holds, then $43 \%$ of the 606 males would be expected to vote for Clinton and the remaining $57 \%$ of the males would be expected to vote for Obama. Similarly, $43 \%$ of the 836 females would be expected to vote for Clinton and the remaining $57 \%$ would be expected to vote for Obama. This is illustrated in Table $4.2$. Formally, this can be stated as $\pi_{i j} / \pi_{i+}=\pi_{+j}$ for each column $(j=1,2, \ldots, J)$ or $\pi_{i j} / \pi_{+j}=\pi_{i+}$ for each row $(i=1,2, \ldots, I)$. Rearranging either of these formulas, this relationship can also be formally stated as $\pi_{i}=\pi_{i+} \pi_{+j}$

In statistical terms, if, in the population, two variables are independent, then their joint probability $\left(\pi_{i j}\right)$ can be determined solely on the basis of the marginal probabilities $\left(\pi_{i+} \pi_{+}\right)$. As usual, these population parameters can be estimated using sample data. For instance, using our example in Table 4.1, if gender and voting choice were independent, then the probability of a woman voting for Clinton could be obtained from multiplying the probability of a voter being female by the probability of a voter choosing Clinton:
$$
\begin{aligned}
p_{21} &=\left(p_{2+}\right)\left(p_{+1}\right) \
&=(\text { Proportion of females })(\text { Proportion choosing Clinton }) \
&=(836 / 1442)(618 / 1442)=(0.58)(0.43)=0.25
\end{aligned}
$$
So, if independence holds, we would expect that $25 \%$ of the 1,442 voters would be females who voted for Clinton, and we could similarly obtain the expected probabilities (and frequencies) for all other cells in the contingency table. This mathematical relationship between the joint and marginal probabilities will not hold if there is an association between the two variables. These computations are further discussed and demonstrated in Section 4.4.

统计代写|属性数据分析作业代写analysis of categorical data代考|Odds Ratio

The odds of an event occurring (sometimes also labeled a “success”, as in Chapter 2 ) are the probability that the event occurs relative to the probability that the event does not occur. For example, if the odds that a student in the United States will graduate from high school are $2.5$, then the probability that the student will graduate is $2.5$ times greater than the probability that the student will not graduate. If the probability that the event occurs in the population is $\pi$, then the odds that the event occurs are
$\mathrm{Odds}=\frac{\pi}{1-\pi}$
Rearranging Equation $4.1$ to solve for the probability, we obtain
$$
\begin{aligned}
&\text { Odds }=\frac{\pi}{1-\pi} \
&\text { Odds }(1-\pi)=\pi \
&\text { Odds }-\text { Odds }(\pi)=\pi \
&\text { Odds }=\pi+\text { Odds }(\pi) \
&\text { Odds }=\pi(1+\text { Odds }) \
&\frac{\text { Odds }}{1+\text { Odds }}=\pi
\end{aligned}
$$
In other words, while the odds are expressed in terms of the probability in Equation $4.1$, the probability can be expressed in terms of the odds by the equation
$$
\pi=\frac{\text { Odds }}{1+\text { Odds }}
$$
So, for example, if the odds of graduating from high school are $2.5$, the probability of graduating from high school would be
$$
\pi=\frac{2.5}{1+2.5}=\frac{2.5}{3.5}=0.71
$$

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|Contingency Tables for Two Categorical Variables

为了测量两个分类变量之间的关联，我们使用一个列联表来总结在每个变量类别中观察到的（联合）频率。例如，当我们第一次写这一章时，希拉里·克林顿和巴拉克·奥巴马之间的 2008 年民主总统候选人竞选仍未决定，而且在新闻中很常见。假设我们想知道威斯康星州民主党初选中的选民性别和候选人选择之间是否存在关联。

在一项出口民意调查中1,442 在伊斯康辛选民，42%男性和58%女性，据观察67%男性和50%的女性投票给奥巴马（CNN 选举中心，2008 年）。桌子4.1提出了2−b是−2(2×2)列联表用于总结性别（男性或女性）和候选人选择（克林顿或奥巴马）变量的频率。

我们使用表4.1介绍列联表的一些符号和术语。首先，行变量的类别总数表示为一世, 每个类别

索引为一世，而列变量的类别总数表示为Ĵ，每个类别由j. 在我们的示例中，性别有一世=2类别（例如，一世=1男性；一世=2女性）和候选人有Ĵ=2类别（例如，j=1对于克林顿；j=2奥巴马）。通常，列联表的大小表示为一世×Ĵ（IE，2×2在我们的示例中）。

表中每个单元格中的频率，称为联合频率，表示为n一世j. 表中以粗体显示的每个数字4.1是联合或单元频率。例如，n11在表中4.1代表男性选民的数量(一世=1)并投票给克林顿(j=1)，所以n11=200，尽管n12在表中4.1代表男性选民的数量(一世=1)并投票给奥巴马(j=2)，所以n12=406. 总之，单元频率表示两个分类变量的联合分布。重要的是要注意，每个单独的观察只能计算一次，因此它必须出现（或分类到）表格的一个且仅一个单元格中。
出现在表格边缘的每个频率称为边缘频率，代表一个变量的一个类别的行或列总数。行的边际频率表示为n一世+并且一列的边际频率表示为n+j. 边缘频率在表 4.1 中用阴影表示。例如，表中男性的行总频率或边际频率4.1是n1+=606（并代表样本中男性的总数），而女性的边际频率为n2+=836（并代表样本中的女性总数）。相似地，n+1=618是克林顿选民的列边际频率，并且n+2=824是奥巴马选民的列边际频率。行（或列）的边际频率共同表示行（或列）变量的边际分布。最后，观察总数表示为n++, 所以在这个例子中n++=1442.

通过将单元频率除以观察总数，可以将每个单元频率转换为联合比例（或概率）。在群体中，这些细胞比例表示为圆周率一世,，而在样本中它们表示为p一世j=n一世j/n++. 同样，每个边缘频率(n一世+或者n+)当除以观察总数时，可以转换为边际比例或概率。例如，从表 4.1 中，投票给克林顿的女性选民的联合比例为p21=n21/ n++=418/1442=0.29，投票给克林顿的选民的边际比例是p+1= n+1/n++=618/1442=0.43.

统计代写|属性数据分析作业代写analysis of categorical data代考|Independence

正如我们通常使用相关系数来评估两个连续变量之间的关联一样，我们使用一个称为优势比的值来评估两个分类变量之间的关联。然而，在我们定义和讨论优势比之前，我们先扩展一个

关于两个变量之间独立性的概念，这是分类数据分析中的一个关键概念。

当两个分类变量相互独立时，它们不相关。例如，如果性别和候选人选择是自变量，那么其中一个与另一个无关，这意味着无论我们是否知道选民的性别，我们都能够很好地预测候选人的选择。因此，如果知道选民的性别并不能帮助预测该选民选择的候选人，那么性别和候选人选择之间就没有关系，这两个变量是独立的。此外，如果知道一个变量的值（类别）对预测另一个变量的值（类别）没有影响，那么每一行的列概率分布应该相同，每一列的行概率分布应该相同. 在我们的示例中（表 4.1），43%(618/1442) 对于克林顿和57%(824/1442)奥巴马的候选人选择分布也应该是男性和女性的候选人选择分布。也就是说，如果独立性成立，那么43%预计将有 606 名男性投票给克林顿，其余的57%预计男性将投票给奥巴马。相似地，43%预计 836 名女性将投票给克林顿，其余的57%预计将投票给奥巴马。这在表中说明4.2. 形式上，这可以表述为圆周率一世j/圆周率一世+=圆周率+j对于每一列(j=1,2,…,Ĵ)或者圆周率一世j/圆周率+j=圆周率一世+对于每一行(一世=1,2,…,一世). 重新排列这些公式中的任何一个，这种关系也可以正式表述为圆周率一世=圆周率一世+圆周率+j

用统计术语来说，如果在总体中，两个变量是独立的，那么它们的联合概率(圆周率一世j)可以仅根据边际概率来确定(圆周率一世+圆周率+). 像往常一样，可以使用样本数据估计这些总体参数。例如，使用我们在表 4.1 中的示例，如果性别和投票选择是独立的，那么女性投票给克林顿的概率可以通过将选民是女性的概率乘以选民选择克林顿的概率来获得：
p21=(p2+)(p+1) =( 女性比例 )( 选择克林顿的比例 ) =(836/1442)(618/1442)=(0.58)(0.43)=0.25
因此，如果独立性成立，我们预计25%在 1,442 名选民中，将是投票给克林顿的女性，我们同样可以获得列联表中所有其他单元格的预期概率（和频率）。如果两个变量之间存在关联，则联合概率和边际概率之间的这种数学关系将不成立。这些计算将在 4.4 节中进一步讨论和演示。

统计代写|属性数据分析作业代写analysis of categorical data代考|Odds Ratio

事件发生的几率（有时也称为“成功”，如第 2 章所示）是事件发生的概率相对于事件不发生的概率。例如，如果美国学生从高中毕业的几率是2.5，则学生毕业的概率为2.5比学生不毕业的概率大几倍。如果事件在总体中发生的概率是圆周率, 那么事件发生的几率是
这dds=圆周率1−圆周率
重排方程4.1为了求解概率，我们得到
赔率 =圆周率1−圆周率赔率 (1−圆周率)=圆周率赔率 − 赔率 (圆周率)=圆周率赔率 =圆周率+ 赔率 (圆周率) 赔率 =圆周率(1+ 赔率 ) 赔率 1+ 赔率 =圆周率
换句话说，虽然几率用方程式中的概率表示4.1，概率可以用等式的几率表示
圆周率= 赔率 1+ 赔率
因此，例如，如果高中毕业的几率是2.5，高中毕业的概率为
圆周率=2.51+2.5=2.53.5=0.71

统计代写|属性数据分析作业代写analysis of categorical data代考请认准statistics-lab™

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

在概率论概念中，随机过程是随机变量的集合。若一随机系统的样本点是随机函数，则称此函数为样本函数，这一随机系统全部样本函数的集合是一个随机过程。实际应用中，样本函数的一般定义在时间域或者空间域。 随机过程的实例如股票和汇率的波动、语音信号、视频信号、体温的变化，随机运动如布朗运动、随机徘徊等等。

贝叶斯方法代考

贝叶斯统计概念及数据分析表示使用概率陈述回答有关未知参数的研究问题以及统计范式。后验分布包括关于参数的先验分布，和基于观测数据提供关于参数的信息似然模型。根据选择的先验分布和似然模型，后验分布可以解析或近似，例如，马尔科夫链蒙特卡罗 (MCMC) 方法之一。贝叶斯统计概念及数据分析使用后验分布来形成模型参数的各种摘要，包括点估计，如后验平均值、中位数、百分位数和称为可信区间的区间估计。此外，所有关于模型参数的统计检验都可以表示为基于估计后验分布的概率报表。

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

statistics-lab作为专业的留学生服务机构，多年来已为美国、英国、加拿大、澳洲等留学热门地的学生提供专业的学术服务，包括但不限于Essay代写，Assignment代写，Dissertation代写，Report代写，小组作业代写，Proposal代写，Paper代写，Presentation代写，计算机作业代写，论文修改和润色，网课代做，exam代考等等。写作范围涵盖高中，本科，研究生等海外留学全阶段，辐射金融，经济学，会计学，审计学，管理学等全球99%专业科目。写作团队既有专业英语母语作者，也有海外名校硕博留学生，每位写作老师都拥有过硬的语言能力，专业的学科背景和学术写作经验。我们承诺100%原创，100%专业，100%准时，100%满意。

机器学习代写

随着AI的大潮到来，Machine Learning逐渐成为一个新的学习热点。同时与传统CS相比，Machine Learning在其他领域也有着广泛的应用，因此这门学科成为不仅折磨CS专业同学的“小恶魔”，也是折磨生物、化学、统计等其他学科留学生的“大魔王”。学习Machine learning的一大绊脚石在于使用语言众多，跨学科范围广，所以学习起来尤其困难。但是不管你在学习Machine Learning时遇到任何难题，StudyGate专业导师团队都能为你轻松解决。

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

随机过程，是依赖于参数的一组随机变量的全体，参数通常是时间。随机变量是随机现象的数量表现，其时间序列是一组按照时间发生先后顺序进行排列的数据点序列。通常一组时间序列的时间间隔为一恒定值（如1秒，5分钟，12小时，7天，1年），因此时间序列可以作为离散时间数据进行分析处理。研究时间序列数据的意义在于现实中，往往需要研究某个事物其随时间发展变化的规律。这就需要通过研究该事物过去发展的历史记录，以得到其自身发展的规律。

回归分析代写

多元回归分析渐进（Multiple Regression Analysis Asymptotics）属于计量经济学领域，主要是一种数学上的统计分析方法，可以分析复杂情况下各影响因素的数学关系，在自然科学、社会和经济学等多个领域内应用广泛。

MATLAB代写

MATLAB 是一种用于技术计算的高性能语言。它将计算、可视化和编程集成在一个易于使用的环境中，其中问题和解决方案以熟悉的数学符号表示。典型用途包括：数学和计算算法开发建模、仿真和原型制作数据分析、探索和可视化科学和工程图形应用程序开发，包括图形用户界面构建MATLAB 是一个交互式系统，其基本数据元素是一个不需要维度的数组。这使您可以解决许多技术计算问题，尤其是那些具有矩阵和向量公式的问题，而只需用 C 或 Fortran 等标量非交互式语言编写程序所需的时间的一小部分。MATLAB 名称代表矩阵实验室。MATLAB 最初的编写目的是提供对由 LINPACK 和 EISPACK 项目开发的矩阵软件的轻松访问，这两个项目共同代表了矩阵计算软件的最新技术。MATLAB 经过多年的发展，得到了许多用户的投入。在大学环境中，它是数学、工程和科学入门和高级课程的标准教学工具。在工业领域，MATLAB 是高效研究、开发和分析的首选工具。MATLAB 具有一系列称为工具箱的特定于应用程序的解决方案。对于大多数 MATLAB 用户来说非常重要，工具箱允许您学习和应用专业技术。工具箱是 MATLAB 函数（M 文件）的综合集合，可扩展 MATLAB 环境以解决特定类别的问题。可用工具箱的领域包括信号处理、控制系统、神经网络、模糊逻辑、小波、仿真等。

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|Computer Output: Goodness-of-Fit Example

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|Computer Output: Goodness-of-Fit Example

统计代写|属性数据分析作业代写analysis of categorical data代考|SAS

The SAS program for obtaining most of the results discussed in Section $3.4$ is provided in Figure $3.12$ and the output is provided in Figure 3.13. In the program (Figure 3.12):

Two variables are specified: proficiency level (prof) and frequency (count). As was the case with a single proportion, the count variable would not be necessary if raw data were analyzed.
The proc freq and weight statements (lines one and two of the procedure) are the same as they were in the previous example (i.e., Section 3.5.1).
The tables statement (third line of proc freq) requests the frequency table for the proficiency categories and includes the option testp $=$, which is followed by the proportions of the expected frequency distribution.
Note that SAS will automatically compute the expected frequencies based on these expected proportions and the total number of observations (which the program obtains by summing the observed frequencies from the data).
Due to the use of the order=data option in the proc freq statement, the proportions entered after testp= must be specified according to the order of the categories as they appear in the data set.
The output, shown in Figure 3.13, provides:
The frequency table, including the observed frequencies and proportions as well as the expected proportions (as specified in the program).
The chi-squared test output, consisting of the Pearson test statistic (chi-squared = 12351.64) as well as the degrees of freedom $(d f=3)$ and $p$-value $(<.0001)$ of the test.

Other options (and more extensive output) are available when several variables (i.e., twoway tables) are analyzed, as will be discussed in the next chapter.

统计代写|属性数据分析作业代写analysis of categorical data代考|SPSS

The data are entered in SPSS in the same manner used for data entry in SAS. That is, two variables are entered: the proficiency level (proficiency) and the frequency (count). If raw data were used, the counts would be computed by the program and would not be needed as input. The proficiency categories were labeled under Values in the Variable View tab, such that Advanced $=4$, Proficient $=3$, Basic $=2$, and Minimal $=1$.
To indicate that frequencies rather than raw data are used, we again need to:

Click on the Data menu and select Weight Cases.
In the dialogue box that opens, click on Weight Cases by, then click on the count variable and move it to the frequency variable box, then click OK.
To obtain the chi-squared goodness-of-fit test:
Choose Nonparametric Tests in the Analyze menu and click on One Sample.
This will bring up the same window that was obtained when performing the binomial and score tests, with three file tabs on the top. Select the third file tab, Settings, select the option that is titled Customized Tests, and then select the second button, Compare observed probability to hypothesized (chi-squared test).The expected proportions (i.e., $0.15,0.40,0.30$, and $0.15$ ) need to be specified by clicking on the Options button and adding each category (1-4) with its corresponding expected proportion (Relative Frequency).
The syntax is provided in Figure 3.14.
The output that is automatically displayed in the output window is shown in Figure $3.15$. Double clicking this box in the output window will provide the model view output illustrated in Figure 3.16, which includes
A graphic display (bar graph) of the observed and expected frequencies.
The Pearson chi-squared test statistic of $12351.64$, with 3 degrees of freedom and a $p$-value of $.000$ (which implies $p<0.0001$ ).

统计代写|属性数据分析作业代写analysis of categorical data代考|R

The $\mathrm{R}$ program (in bold) and output for obtaining the goodness of fit test are provided in Figure 3.17. The elements of the program are as follows:

Define the variables and save the data (using the data.frame function) to an object called “ch3ex2”. Note that “c” is needed before each variable vector so that these are concatenated in the appropriate order (i.e., each element of one vector corresponds with the element in the other vector occupying the same position).
Use the as.factor function to indicate that the Proficiency variable is categorical.
Define the expected probabilities and save them as “testp”, then define the observed frequencies as the values of the “count” variable and save them as “obs”.
Run the goodness-of-fit test using the chisq.test function.
In addition, Figure $3.18$ shows how to obtain a graph of the observed and expected frequencies in $\mathrm{R}$.

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|SAS

获得第 1 节中讨论的大部分结果的 SAS 程序3.4如图所示3.12输出如图 3.13 所示。在程序中（图 3.12）：

指定了两个变量：熟练程度 (prof) 和频率 (count)。与单一比例的情况一样，如果分析原始数据，则不需要计数变量。
proc freq 和 weight 语句（过程的第一行和第二行）与前一个示例（即第 3.5.1 节）中的相同。
tables 语句（proc freq 的第三行）请求熟练程度类别的频率表，并包括选项 testp=，其后是预期频率分布的比例。
请注意，SAS 将根据这些预期比例和观察总数（程序通过对数据中观察到的频率求和获得）自动计算预期频率。
由于在 proc freq 语句中使用了 order=data 选项，在 testp= 之后输入的比例必须根据类别在数据集中出现的顺序指定。
输出，如图 3.13 所示，提供：
频率表，包括观察到的频率和比例以及预期的比例（在程序中指定）。
卡方检验输出，包括 Pearson 检验统计量（卡方 = 12351.64）以及自由度(dF=3)和p-价值(<.0001)的测试。

当分析多个变量（即双向表）时，可以使用其他选项（以及更广泛的输出），这将在下一章中讨论。

统计代写|属性数据分析作业代写analysis of categorical data代考|SPSS

在 SPSS 中输入数据的方式与在 SAS 中输入数据的方式相同。即，输入两个变量：熟练程度（熟练度）和频率（计数）。如果使用原始数据，计数将由程序计算，不需要作为输入。熟练程度类别在变量视图选项卡中的值下标记，例如高级=4, 精通=3，基本的=2, 和最小=1.
为了表明使用频率而不是原始数据，我们再次需要：

单击数据菜单并选择重量案例。
在打开的对话框中，单击 Weight Cases by，然后单击计数变量并将其移动到频率变量框，然后单击确定。
要获得卡方拟合优度检验：
在分析菜单中选择非参数检验，然后单击一个样本。
这将打开执行二项式和分数测试时获得的相同窗口，顶部有三个文件选项卡。选择第三个文件选项卡，设置，选择标题为自定义测试的选项，然后选择第二个按钮，将观察到的概率与假设的比较（卡方检验）。预期的比例（即，0.15,0.40,0.30，和0.15) 需要通过单击选项按钮并添加每个类别 (1-4) 及其相应的预期比例（相对频率）来指定。
图 3.14 提供了语法。
输出窗口中自动显示的输出如图3.15. 在输出窗口中双击此框将提供如图 3.16 所示的模型视图输出，其中包括
观察到的和预期的频率的图形显示（条形图）。
Pearson 卡方检验统计量12351.64, 具有 3 个自由度和一个p-的价值.000（这意味着p<0.0001).

统计代写|属性数据分析作业代写analysis of categorical data代考|R

这R图 3.17 提供了获得拟合优度检验的程序（粗体）和输出。该方案的要素如下：

定义变量并将数据（使用 data.frame 函数）保存到名为“ch3ex2”的对象中。请注意，在每个变量向量之前需要“c”，以便将它们以适当的顺序连接起来（即，一个向量的每个元素都与另一个向量中占据相同位置的元素相对应）。
使用 as.factor 函数指示 Proficiency 变量是分类变量。
定义预期概率并将它们保存为“testp”，然后将观察到的频率定义为“count”变量的值并将它们保存为“obs”。
使用 chisq.test 函数运行拟合优度检验。
此外，图3.18显示了如何获得观测频率和预期频率的图表R.

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|Confidence Intervals for a Single Proportion

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|Confidence Intervals for a Single Proportion

While hypothesis testing gives an indication as to whether the observed proportion is consistent with a population proportion of interest (specified under $H_{0}$ ), a confidence interval for the population proportion provides information on the possible value of the “true” proportion in the population of interest. In general, the formula for a confidence interval is
Statistic $\pm$ (Critical value) (Standard error)
where the statistic is the sample estimate of the parameter, the critical value depends on the level of confidence desired and is obtained from the sampling distribution of the statistic, and the standard error is the standard deviation of the sampling distribution of the statistic. For example, in constructing a confidence interval for the population mean, one would typically use the sample mean as the statistic, a value from the t-distribution (with $n-1$ degrees of freedom at the desired confidence level) as the critical value, and $s / \sqrt{n}$ for the standard error (where $s$ is the sample standard deviation).

Generalizing this to a confidence interval for a proportion, the sample proportion, $p$, is used as the statistic and the critical value is obtained from the standard normal distribution (e.g., $z=1.96$ for a $95 \%$ confidence interval). To compute the standard error, we refer back to the Wald approach, which uses $\sqrt{p(1-p) / n}$ as the standard error of a proportion. Therefore, the confidence interval for a proportion is computed using
$$
p \pm z_{\alpha / 2} \sqrt{p(1-p) / n},
$$
where $z_{\alpha / 2}$ is the critical value from the standard normal distribution for a $(1-\alpha) \%$ confidence level.

For example, to construct a $95 \%$ confidence interval using our sample (where $n=10$ and $k=7$ ), we have $p=0.7, \alpha=0.05$ so $z_{a / 2}=1.96$, and $\sqrt{p(1-p) / n}=0.145$. Therefore, the $95 \%$ confidence interval for our example is
$$
0.7 \pm 1.96(0.145)=0.7 \pm 0.284=[0.416,0.984]
$$
Based on this result, we can be $95 \%$ confident that the proportion of students in the population (from which the sample was obtained) who are proficient in mathematics is somewhere between (approximately) $42 \%$ and $98 \%$. This is a very large range due to the fact that we have a very small sample (and, thus, a relatively large standard error). As we mentioned previously in our discussion of the Wald test, it is somewhat unreliable to compute the standard error based on the sample proportion, especially when the sample size is small.

The value added to and subtracted from the sample proportion (e.g., 0.284) is called the margin of error. The larger it is, the wider the confidence interval and the less precise our estimate. In our example earlier, the margin of error is over $28 \%$ due to our small sample. When we are designing a study, if wish to aim for a certain margin of error for our estimate (as is done, for example, in polling research) we can “work backward” and solve for the sample size needed. That is, the sample size needed for a given margin of error, $M E$, is:
$$
n=\frac{p(1-p)}{(M E / z)^{2}}
$$
where $n$ is the sample size, $p$ is the sample proportion, $M E$ is the desired margin of error, and $z$ is the critical value corresponding to the desired confidence level. For example, suppose that we wanted our estimate to be accurate to within $2 \%$, with $95 \%$ confidence. The sample size needed to achieve this, given a proportion of $0.7$, would be
$$
n=\frac{p(1-p)}{(M E / z)^{2}}=\frac{0.7(0.3)}{(0.02 / 1.96)^{2}}=2017
$$
Note that for a confidence interval we do not have the option of replacing $p$ with $\pi_{0}$ for estimating the standard error (as we did with the score test) because a confidence interval does not involve the specification of a null hypothesis. Although there are other available methods for computing confidence intervals for a proportion, they are beyond the scope of this book. We refer the interested reader to Agresti (2007), who suggests using the hypothesis testing formula to essentially “work backward” and solve for a confidence interval. Other alternatives include the Agresti-Coull confidence interval, which is an approximation of the method that uses the hypothesis testing formula (Agresti \& Coull, 1998), and the $F$ distribution method (Collett, 1991; Leemis \& Trivedi, 1996), which provides exact confidence limits for the binomial proportion. The latter (confidence interval computed by the $F$ distribution method) can be obtained from SAS.

统计代写|属性数据分析作业代写analysis of categorical data代考|Goodness-of-Fit: Comparing Distributions for a Single

In the previous sections we discussed a variable (proficiency in mathematics) that took on only two values (yes or no) because it was measured in a dichotomous manner. While the methods discussed so far are appropriate for such dichotomous variables, when a categorical

variable consists of more than two categories it may be necessary to evaluate several proportions. For example, the Wisconsin Department of Public Instruction (2006b) uses four categories to measure mathematics proficiency: advanced, proficient, basic, and minimal. To determine if there has been a change in the proficiency classification of Wisconsin students after a year of implementing an intensive program designed to increase student proficiency in mathematics, a test can be performed that compares the expected and observed frequency distributions. This test is called the chi-squared $\left(\chi^{2}\right)$ goodness-of-fit test because it tests whether the observed data “fit” with expectations. The null hypothesis of this test states that the expected and observed frequency distributions are the same, so a rejection of this null hypothesis indicates that the observed frequencies exhibit significant departures from the expected frequencies.

For example, suppose that the values in the second column of Table $3.3$ (expected proportions) represent the proportion of Wisconsin 10 th-grade students in each of the four proficiency classifications in 2005 . If there has been no change in the proficiency distribution, these would constitute the proportions expected in 2006 as well. Suppose further that the last column of Table $3.3$ represents (approximately) the observed mathematics proficiency classifications (frequencies) for 71,70910 th-grade Wisconsin students in 2006 (Wisconsin Department of Public Instruction, 2006a). Using these data, we may wish to determine whether there has been a change in the proficiency level distribution from 2005 to $2006 .$
The Pearson chi-squared test statistic for comparing two frequency distributions is
$$
X^{2}=\sum_{\text {all categsciss }} \frac{(\text { observed frequency }-\text { expected frequency })^{2}}{\text { expected frequency }}
$$
$$
=\sum_{i=1}^{\varepsilon} \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}
$$
where $O_{i}$ represents the observed frequency in the $t^{\text {th }}$ category and $E_{i}$ represents the expected frequency in the $i^{\text {th }}$ category. This $X^{2}$ test statistic follows a $\chi^{2}$ distrution with $c-1$ degrees of freedom, where $c$ is the total number of categories. The reason for this is that only $c-1$ category frequencies can vary “freely” for a given total sample size, $n$, because the frequencies across the $c$ categories must add up to the total sample size. Therefore, once $c$ – 1 frequencies are known, the last category frequency can be determined by subtracting those frequencies from the total sample size.

The expected frequencies are specified by the null hypothesis; that is, they are the frequencies one expects to observe if the null hypothesis is true. In our example, the null hypothesis would state that there is no change in frequencies between 2005 and 2006 ,

so the two probability distributions should be the same. Therefore, if the null hypothesis is true, the 2006 probabilities would follow the 2005 probabilities across the proficiency categories. Because the test statistic uses frequencies rather than proportions, we must convert the 2005 proportions in the second column of Table $3.3$ to frequencies based on the total of 71,709 students. These values are shown in the third column of Table $3.3$, under expected frequency (for example, $15 \%$ of 71,709 is $10,756.35$ ). Thus, we can test whether the frequency distributions are the same or different by comparing the last two columns of Table $3.3$ using a goodness-of-fit test. The test statistic comparing these two frequency distributions is
$$
\begin{aligned}
X^{2} &=\frac{(18644-10756.35)^{2}}{10756.35}+\frac{(32269-28683.6)^{2}}{28683.6}+\frac{(10039-21512.7)^{2}}{21512.7}+\frac{(10757-10756.35)^{2}}{10756.35} \
&=5784.027+448.1688+6119.445+0=12351.64 .
\end{aligned}
$$

统计代写|属性数据分析作业代写analysis of categorical data代考|Computer Output: Single Proportion Example

As you examine the (annotated) output in Figure 3.5, you may wish to refer back to and compare the results summarized in Table $3.2$ and discussed in Section 3.3. The output (Figure 3.5) provides:

The frequency table.
The “Chi-Squared Test for Equal Proportions”, which is not the likelihood ratio test but rather the squared version of the score test.
The “Binomial Proportion for prof = yes” section, which includes the hypothesis tests and confidence intervals for the proportion of students who are proficient, as discussed in this chapter. Specifically, the following are provided in this part of the output:
The proportion of yes responses (i.e., the sample estimate of $0.7$ ).
The ASE (which stands for the asymptotic standard error) of the proportion, $0.1449$, computed using the sample proportion (i.e., as in the Wald test).
The $95 \%$ confidence interval, with limits $0.416$ and $0.984$, which is computed using the ASE.
The exact $95 \%$ confidence interval limits $(0.3475,0.9333)$, which are based on the $F$ distribution method referred to in Section 3.3.
The results of the test of the null hypothesis $H_{0}: \pi=0.8$. Specifically:
The “ASE under $H_{0}$ ” of $0.1265$ refers to the asymptotic standard error computed by replacing the sample proportion $(p)$ with the null hypothesis value $\left(\pi_{0}\right)$, which is the standard error used by the score test.
The ” $Z$ test-statistic” provided by in this part of the output $(-0.79)$ is based on the score test, as are the $p$-values that follow it $(0.2146$ and $0.4292$ for one- and two-tailed tests, respectively).
Finally, results of the exact test (using the binomial distribution probabilities) are provided in the form of one- and two-tailed $p$-values ( $0.3222$ and $0.6444$, respectively).

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|Confidence Intervals for a Single Proportion

虽然假设检验给出了关于观察到的比例是否与感兴趣的总体比例一致的指示（在H0)，总体比例的置信区间提供了有关感兴趣总体中“真实”比例的可能值的信息。一般来说，置信区间的公式是
统计±(临界值) (标准误差)
其中统计量是参数的样本估计，临界值取决于所需的置信水平，并且是从统计量的抽样分布中获得的，标准误差是参数的标准偏差统计量的抽样分布。例如，在构建总体均值的置信区间时，通常会使用样本均值作为统计量，即来自 t 分布的值（使用n−1所需置信水平下的自由度）作为临界值，以及s/n对于标准误差（其中s是样本标准差）。

将此推广到一个比例的置信区间，即样本比例，p, 用作统计量，临界值从标准正态分布中获得（例如，和=1.96为一个95%置信区间）。为了计算标准误差，我们参考 Wald 方法，它使用p(1−p)/n作为比例的标准误。因此，比例的置信区间是使用计算的
p±和一种/2p(1−p)/n,
在哪里和一种/2是标准正态分布的临界值(1−一种)%置信水平。

例如，构建一个95%使用我们的样本的置信区间（其中n=10和到=7），我们有p=0.7,一种=0.05所以和一种/2=1.96，和p(1−p)/n=0.145. 因此，95%我们的例子的置信区间是
0.7±1.96(0.145)=0.7±0.284=[0.416,0.984]
基于这个结果，我们可以95%确信在总体中（从中获得样本）精通数学的学生的比例介于（大约）之间42%和98%. 这是一个非常大的范围，因为我们的样本非常小（因此，标准误差相对较大）。正如我们之前在讨论 Wald 检验时提到的，根据样本比例计算标准误差有些不可靠，尤其是在样本量较小的情况下。

样本比例的加减值（例如，0.284）称为误差范围。它越大，置信区间越宽，我们的估计就越不精确。在我们之前的示例中，误差幅度超过28%由于我们的样本量小。当我们设计一项研究时，如果希望为我们的估计设定一定的误差范围（例如，在民意调查研究中就是这样做的），我们可以“向后工作”并解决所需的样本量。也就是说，给定误差范围所需的样本量，米和，是：
n=p(1−p)(米和/和)2
在哪里n是样本量，p是样本比例，米和是期望的误差范围，并且和是对应于所需置信水平的临界值。例如，假设我们希望我们的估计准确到2%，和95%信心。实现这一目标所需的样本量，给定比例0.7，将会
n=p(1−p)(米和/和)2=0.7(0.3)(0.02/1.96)2=2017
请注意，对于置信区间，我们没有替换的选项p和圆周率0用于估计标准误差（就像我们对分数测试所做的那样），因为置信区间不涉及零假设的规范。尽管还有其他可用的方法来计算比例的置信区间，但它们超出了本书的范围。我们将感兴趣的读者推荐给 Agresti (2007)，他建议使用假设检验公式基本上“向后工作”并求解置信区间。其他替代方法包括 Agresti-Coull 置信区间，它是使用假设检验公式的方法的近似值 (Agresti \& Coull, 1998)，以及F分布方法 (Collett, 1991; Leemis \& Trivedi, 1996)，它为二项式比例提供了准确的置信限。后者（由F分配方法）可以从 SAS 获得。

统计代写|属性数据分析作业代写analysis of categorical data代考|Goodness-of-Fit: Comparing Distributions for a Single

在前面的部分中，我们讨论了一个变量（数学熟练度），它只取两个值（是或否），因为它是以二分法的方式测量的。虽然到目前为止讨论的方法适用于此类二分变量，但当分类变量

变量由两个以上的类别组成，可能需要评估几个比例。例如，威斯康星州公共教学部 (2006b) 使用四个类别来衡量数学熟练程度：高级、熟练、基本和最低限度。为了确定威斯康星州学生在实施旨在提高学生数学能力的强化计划一年后的能力分类是否发生变化，可以进行一项测试，比较预期和观察到的频率分布。这个检验被称为卡方(χ2)拟合优度检验，因为它检验观察到的数据是否“符合”预期。此检验的原假设表明预期和观察到的频率分布相同，因此拒绝该原假设表明观察到的频率与预期频率存在显着偏差。

例如，假设 Table 的第二列中的值3.3（预期比例）代表 2005 年威斯康星州 10 年级学生在四个能力分类中的比例。如果熟练度分布没有变化，这些也将构成 2006 年的预期比例。进一步假设 Table 的最后一列3.3代表（大约）在 2006 年观察到的 71,70910 名威斯康星州学生的数学能力分类（频率）（威斯康星州公共教学部，2006a）。使用这些数据，我们可能希望确定从 2005 年到2006.
用于比较两个频率分布的 Pearson 卡方检验统计量是
X2=∑所有分类 ( 观察频率 − 预期频率 )2 预期频率
=∑一世=1e(这一世−和一世)2和一世
在哪里这一世表示观察到的频率吨th 类别和和一世表示预期的频率一世th 类别。这X2检验统计量遵循χ2与C−1自由度，其中C是类别的总数。这样做的原因只是C−1对于给定的总样本量，类别频率可以“自由”变化，n，因为整个频率C类别的总和必须等于总样本量。因此，一旦C– 1 个频率是已知的，最后一个类别的频率可以通过从总样本量中减去这些频率来确定。

预期频率由零假设指定；也就是说，如果零假设为真，它们是人们期望观察到的频率。在我们的示例中，原假设将声明 2005 和 2006 之间的频率没有变化，

所以这两个概率分布应该是一样的。因此，如果原假设为真，则 2006 年的概率将遵循 2005 年跨熟练度类别的概率。因为检验统计量使用频率而不是比例，所以我们必须在表的第二列中转换 2005 年的比例3.3频率基于 71,709 名学生的总数。这些值显示在表的第三列3.3，低于预期的频率（例如，15%71,709 是10,756.35）。因此，我们可以通过比较表的最后两列来测试频率分布是相同还是不同3.3使用拟合优度检验。比较这两种频率分布的检验统计量是
X2=(18644−10756.35)210756.35+(32269−28683.6)228683.6+(10039−21512.7)221512.7+(10757−10756.35)210756.35 =5784.027+448.1688+6119.445+0=12351.64.

统计代写|属性数据分析作业代写analysis of categorical data代考|Computer Output: Single Proportion Example

当您检查图 3.5 中的（带注释的）输出时，您可能希望回顾并比较表中总结的结果3.2并在第 3.3 节中讨论。输出（图 3.5）提供：

频率表。
“等比例的卡方检验”，它不是似然比检验，而是分数检验的平方版本。
“prof = yes 的二项式比例”部分，包括本章讨论的精通学生比例的假设检验和置信区间。具体来说，这部分输出中提供了以下内容：
是回答的比例（即，样本估计0.7).
比例的 ASE（代表渐近标准误差），0.1449，使用样本比例计算（即，如在 Wald 检验中）。
这95%置信区间，有限制0.416和0.984，这是使用 ASE 计算的。
最正确95%置信区间限制(0.3475,0.9333)，这是基于F分配方法见第 3.3 节。
原假设检验的结果H0:圆周率=0.8. 具体来说：
在“ASE下H0“ 的0.1265指代换样本比例计算的渐近标准误(p)具有零假设值(圆周率0)，这是分数测试使用的标准误差。
这 ”从test-statistic”在这部分输出中提供(−0.79)是基于分数测试，因为是p- 跟随它的值(0.2146和0.4292分别用于单尾和双尾测试）。
最后，精确检验的结果（使用二项分布概率）以单尾和双尾的形式提供p-值（0.3222和0.6444，分别）。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing Using the Normal Approximation

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing Using the Normal Approximation

When the sample size (i.e., value of $n$ ) is relatively large, the binomial distribution can be approximated by the normal distribution so test statistics can be constructed and evaluated using the familiar standard normal distribution. Specifically, the normal approximation can be used when both $n \pi \geq 5$ and $n(1-\pi) \geq 5$. In our example, $n \pi=10(0.8)=8>5$ but $n(1-\pi)=10(0.2)=2<5$ so the normal approximation may not be accurate. Nonetheless, we will proceed with our example to illustrate this method.

The normal approximation test statistic is very similar to the test statistic used in comparing a sample mean against a hypothesized population mean when the dependent variable is continuous. Specifically, a $z$-statistic is constructed using the usual formula
$$
z=\frac{\text { Estimate }-\text { Parameter }}{\text { Standard error }} .
$$
In testing a mean, the estimate is the sample mean, the parameter is the population mean specified under the null hypothesis, and the standard error is the standard deviation of the appropriate sampling distribution. In our case, the estimate is the sample proportion $(p)$ and the parameter is the population proportion specified under the null hypothesis (denoted by $\left.\pi_{\vartheta}\right)$.

To compute the standard error, recall from Chapter 2 (Section $2.5$ ) that the variance for the distribution of frequencies that follow the binomial distribution is $\sigma^{2}=n \pi(1-\pi)$. Because a proportion is a frequency divided by the sample size, $n$, we can use the properties of linear transformations to determine that the variance of the distribution of proportions is equal to the variance of the distribution of frequencies divided by $n^{2}$, or $\sigma^{2} / n^{2}=n \pi(1-\pi) / n^{2}=$ $\pi(1-\pi) / n$.

We can use the sample to estimate the variance by $p(1-p) / n$. The standard error can thus be estimated from sample information using $\sqrt{p(1-p) / n}$, and the test statistic for testing the null hypothesis $H_{0}: \pi=\pi_{0}$ is
$$
z=\frac{p-\pi_{0}}{\sqrt{p(1-p) / n}}
$$
This test statistic follows a standard normal distribution when the sample size is large. For our example, to test $H_{0:}^{:} \pi=0.8$,
$$
z=\frac{p-\pi_{0}}{\sqrt{p(1-p) / n}}=\frac{0.7-0.8}{\sqrt{(0.7)(0.3) / 10}}=\frac{-0.1}{0.145}=-0.69 .
$$
We can use the standard normal distribution to find the $p$-value for this test; that is
$$
P(z \leq-0.69)=0.245
$$
so the two-tailed $p$-value is $2(0.245)=0.49$. Thus, the test statistic in our example does not lead to rejection of $H_{0}$ at the $0.05$ significance level. Therefore, as we do not have sufficient evidence to reject the null hypothesis, we conclude that the sample estimate of $0.7$ is consistent with the notion that $80 \%$ of the students in the population are proficient in math. This procedure is called the Wald test.

One of the drawbacks to the Wald test is that it relies on the estimate of the population proportion (i.e., $p$ ) to compute the standard error, and this could lead to unreliable values for the standard error, especially when the estimate is based on a small sample. A variation on this test, which uses the null hypothesis proportion $\pi_{0}$ (instead of $p$ ) to compute the standard error, is called the score test. Using the score test, the test statistic becomes
$$
z=\frac{p-\pi_{0}}{\sqrt{\pi_{0}\left(1-\pi_{0}\right) / n}}
$$
and it too follows a standard normal distribution when the sample size is large. For our example,
$$
z=\frac{p-\pi_{0}}{\sqrt{0.8(1-0.8) / n}}=\frac{0.7-0.8}{\sqrt{(0.8)(0.2) / 10}}=\frac{-0.1}{0.126}=-0.79
$$
In this case, the $p$-value (using the standard normal distribution) is
$$
P(z \leq-0.79)=0.215
$$

so the two-tailed $p$-value is $2(0.215)=0.43$ and our conclusions do not change: We do not reject the null hypothesis based on this result.

A drawback to the score test is that, by using the null hypothesis value (i.e., $\pi_{\mathrm{o}}$ ) for the standard error, it presumes that this is the value of the population proportion; yet, this may not be a valid assumption and might even seem somewhat counterintuitive when we reject the null hypothesis. Thus, this method is accurate only to the extent that $\pi_{\mathrm{o}}$ is a good estimate of the true population proportion (just as the Wald test is accurate only when the sample proportion, $p$, is a good estimate of the true population proportion).

In general, and thus applicable to both the Wald and score tests, squaring the value of $z$ produces a test statistic that follows the $\chi^{2}$ (chi-squared) distribution with 1 degree of freedom. That is, the $p$-value from the $\chi^{2}$ test (with 1 degree of freedom) is equivalent to that from the two-tailed $z$-test. Another general hypothesis testing method that utilizes the $\chi^{2}$ distribution, the likelihood ratio method, will be used in different contexts throughout the book and is introduced next.

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing Using the Likelihood Ratio Method

The likelihood ratio method compares the likelihood (probability) of the observed data obtained using the proportion specified under the null hypothesis to the likelihood of the observed data obtained using the observed sample estimate. Larger discrepancies between these likelihoods indicate less agreement between the observed data and the null hypothesis, and should thus lead to a rejection of the null hypothesis.

The likelihood obtained under the null hypothesis is denoted by $L_{0}$ and the likelihood obtained using the sample estimate is denoted by $L_{1}$. The ratio $L_{0} / L_{1}$ represents the likelihood ratio. If $L_{1}$ (the likelihood obtained from the observed data) is much larger than $L_{0}$ (the likelihood under $H_{0}$ ), the likelihood ratio will be much smaller than 1 and will indicate that the data provide evidence against the null hypothesis. The likelihood ratio test statistic is obtained by taking the natural logarithm (ln) of the likelihood ratio and multiplying it by $-2$. Specifically, the test statistic is
$$
G^{2}=-2 \ln \left(\frac{L_{0}}{L_{1}}\right)=-2\left[\ln \left(L_{0}\right)-\ln \left(L_{1}\right)\right]
$$
Figure $3.3$ illustrates the natural logarithm function by plotting values of a random variable, $X$, on the horizontal axis against values of its natural $\log a r i t h m, \ln (X)$, on the vertical axis. Note that the natural $\log$ of $X$ will be negative when the value of $X$ is less than 1 , positive when the value of $X$ is greater than 1 , and 0 when the value of $X$ is equal to 1 . Therefore, when the two likelihoods $\left(L_{0}\right.$ and $\left.L_{1}\right)$ are equivalent, the likelihood ratio will be one and the $G^{2}$ test statistic will be 0 . As the likelihood computed from the data $\left(L_{1}\right)$ becomes larger relative to the likelihood under the null hypothesis $\left(L_{0}\right)$, the likelihood ratio will become smaller than 1 , its (natural) $\log$ will become more negative, and the test statistic will become more positive. Thus, a larger (more positive) $G^{2}$ test statistic indicates stronger evidence against $H_{0}$, as is typically the case with test statistics.

In fact, under $H_{0}$ and with reasonably large samples, the $G^{2}$ test statistic follows a $\chi^{2}$ distribution with degrees of freedom $(d f)$ equal to the number of parameters restricted under $H_{0}$ (i.e., $d f=1$ in the case of a single proportion). Because the $\chi^{2}$ distribution consists of squared (i.e., positive) values, it can only be used to test two-tailed hypotheses. In other words, the $p$-value obtained from this test is based on a two-tailed alternative.

For our example, with $n=10$ and $k=7, L_{0}$ is the likelihood of the observed data, $P(Y=7)$, computed using the binomial distribution with the probability parameter $(\pi)$ specified under the null hypothesis $\left(H_{0}: \pi=0.8\right)$ :
$$
L_{0}=P(Y=7)=\left(\begin{array}{c}
10 \
7
\end{array}\right) 0.8^{\top}(1-0.8)^{(10-7)}=0.201 .
$$
Similarly, $L_{1}$ is the likelihood of the observed data given the data-based estimate (of $0.7$ ) for the probability parameter:
$$
L_{1}=P(Y=7)=\left(\begin{array}{c}
10 \
7
\end{array}\right) 0.7^{\tau}(1-0.7)^{(10-\tau)}=0.267 .
$$
Thus, the test statistic is
$$
G^{2}=-2 \ln \left(\frac{L_{0}}{L_{1}}\right)=-2 \ln (0.201 / 0.267)=-2 \ln (0.753)=0.567
$$
The critical value of a $\chi^{2}$ distribution with 1 degree of freedom at the $0.05$ significance level is $3.84$ (see Appendix), so this test statistic does not exceed the critical value and the null hypothesis is not rejected using this two-tailed test. We can also obtain the $p$-value of this test using a $\chi^{2}$ calculator or various software programs: $P\left(\chi_{1}^{2} \geq 0.567\right)=0.45$.

统计代写|属性数据分析作业代写analysis of categorical data代考|Summary of Test Results

We discussed several approaches to null hypothesis testing for a single proportion: the binomial (exact) test, the Wald test, the score test, and the likelihood ratio test. The exact test is typically used for small samples, when the normal approximation may not be valid. For large samples, the Wald and score tests differ only in how they compute the standard error, and the likelihood ratio test is generally considered more accurate than either the Wald or score test (this will be discussed further in Chapter 6). A summary of the test results for our example using these various approaches is presented in Table $3.2$, and the computing section at the end of this chapter (Section 3.5) shows how to obtain the results using computer software.

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing Using the Normal Approximation

当样本量（即n) 相对较大，二项分布可以近似为正态分布，因此可以使用熟悉的标准正态分布构建和评估测试统计量。具体来说，当两者都可以使用正态近似n圆周率≥5和n(1−圆周率)≥5. 在我们的示例中，n圆周率=10(0.8)=8>5但n(1−圆周率)=10(0.2)=2<5所以正常的近似值可能不准确。尽管如此，我们将继续我们的示例来说明这种方法。

当因变量是连续的时，正态近似检验统计量与用于比较样本均值与假设总体均值的检验统计量非常相似。具体来说，一个和-statistic 是使用通常的公式构建的
和= 估计 − 范围标准误差 .
在检验均值时，估计值是样本均值，参数是在原假设下指定的总体均值，标准误差是适当抽样分布的标准差。在我们的例子中，估计是样本比例(p)并且参数是在原假设下指定的总体比例（表示为圆周率ϑ).

要计算标准误差，请回忆第 2 章（第2.5) 服从二项分布的频率分布的方差为σ2=n圆周率(1−圆周率). 因为比例是频率除以样本量，n，我们可以利用线性变换的性质来确定比例分布的方差等于频率分布的方差除以n2，或者σ2/n2=n圆周率(1−圆周率)/n2= 圆周率(1−圆周率)/n.

我们可以使用样本来估计方差p(1−p)/n. 因此，标准误差可以从样本信息中估计，使用p(1−p)/n，以及用于检验原假设的检验统计量H0:圆周率=圆周率0是
和=p−圆周率0p(1−p)/n
当样本量很大时，此检验统计量遵循标准正态分布。对于我们的示例，要测试H0::圆周率=0.8,
和=p−圆周率0p(1−p)/n=0.7−0.8(0.7)(0.3)/10=−0.10.145=−0.69.
我们可以使用标准正态分布来找到p-此测试的值；那是
磷(和≤−0.69)=0.245
所以双尾p-值是2(0.245)=0.49. 因此，我们示例中的检验统计量不会导致拒绝H0在0.05显着性水平。因此，由于我们没有足够的证据来拒绝原假设，我们得出结论：0.7符合以下概念80%人口中的学生精通数学。此过程称为 Wald 测试。

Wald 检验的缺点之一是它依赖于对总体比例的估计（即，p) 来计算标准误差，这可能导致标准误差的值不可靠，尤其是当估计是基于小样本时。此检验的变体，它使用原假设比例圆周率0（代替p) 来计算标准误差，称为分数测试。使用分数测试，测试统计量变为
和=p−圆周率0圆周率0(1−圆周率0)/n
当样本量很大时，它也遵循标准正态分布。对于我们的示例，
和=p−圆周率00.8(1−0.8)/n=0.7−0.8(0.8)(0.2)/10=−0.10.126=−0.79
在这种情况下，p-值（使用标准正态分布）是
磷(和≤−0.79)=0.215

所以双尾p-值是2(0.215)=0.43我们的结论不会改变：我们不会基于此结果拒绝原假设。

分数测试的一个缺点是，通过使用原假设值（即，圆周率这) 对于标准误，它假定这是总体比例的值；然而，这可能不是一个有效的假设，当我们拒绝零假设时，甚至可能看起来有点违反直觉。因此，该方法仅在以下范围内是准确的圆周率这是对真实总体比例的良好估计（正如 Wald 检验仅在样本比例时准确，p, 是对真实人口比例的一个很好的估计）。

一般来说，因此适用于 Wald 和 score 测试，平方值和产生一个检验统计量，它遵循χ2（卡方）分布，自由度为 1。那就是p-值来自χ2测试（自由度为 1）等价于双尾和-测试。另一种一般假设检验方法，利用χ2分布，似然比方法，将在本书的不同上下文中使用，并在接下来介绍。

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing Using the Likelihood Ratio Method

似然比方法将使用在原假设下指定的比例获得的观察数据的似然性（概率）与使用观察到的样本估计获得的观察数据的似然性进行比较。这些可能性之间的较大差异表明观察到的数据与原假设之间的一致性较低，因此应该导致拒绝原假设。

在原假设下获得的可能性表示为大号0并且使用样本估计获得的似然度表示为大号1. 比例大号0/大号1表示似然比。如果大号1（从观察数据中获得的可能性）远大于大号0（下的可能性H0)，似然比将远小于 1，并表明数据提供了反对原假设的证据。似然比检验统计量是取似然比的自然对数 (ln) 并乘以−2. 具体来说，检验统计量是
G2=−2ln⁡(大号0大号1)=−2[ln⁡(大号0)−ln⁡(大号1)]
数字3.3通过绘制随机变量的值来说明自然对数函数，X，在水平轴上相对于其自然值日志⁡一种r一世吨H米,ln⁡(X), 在垂直轴上。注意自然日志的X将是负值时X小于 1 时为正X大于 1 ，当值为 0 时X等于 1 。因此，当两种可能性(大号0和大号1)是等价的，似然比将是 1 并且G2测试统计量将为 0 。作为从数据计算的可能性(大号1)相对于原假设下的可能性变得更大(大号0), 似然比将变得小于 1 , 它的 (自然)日志将变得更加消极，并且测试统计量将变得更加积极。因此，更大的（更积极的）G2检验统计表明更有力的证据反对H0，就像测试统计的典型情况一样。

事实上，根据H0并且有相当大的样本，G2检验统计量遵循χ2自由度分布(dF)等于限制下的参数数量H0（IE，dF=1在单一比例的情况下）。因为χ2分布由平方（即正）值组成，它只能用于检验双尾假设。换句话说，p从该测试中获得的 – 值基于双尾替代方案。

对于我们的示例，使用n=10和到=7,大号0是观测数据的可能性，磷(是=7), 使用带有概率参数的二项分布计算(圆周率)在原假设下指定(H0:圆周率=0.8):
大号0=磷(是=7)=(10 7)0.8⊤(1−0.8)(10−7)=0.201.
相似地，大号1是给定基于数据的估计（的0.7) 对于概率参数：
大号1=磷(是=7)=(10 7)0.7τ(1−0.7)(10−τ)=0.267.
因此，检验统计量为
G2=−2ln⁡(大号0大号1)=−2ln⁡(0.201/0.267)=−2ln⁡(0.753)=0.567
A 的临界值χ2自由度为 1 的分布0.05显着性水平是3.84（见附录），所以这个检验统计量没有超过临界值，并且使用这个双尾检验不会拒绝原假设。我们还可以获得p- 此测试的值使用χ2计算器或各种软件程序：磷(χ12≥0.567)=0.45.

统计代写|属性数据分析作业代写analysis of categorical data代考|Summary of Test Results

我们讨论了对单一比例进行零假设检验的几种方法：二项式（精确）检验、Wald 检验、分数检验和似然比检验。当正态近似可能无效时，精确检验通常用于小样本。对于大样本，Wald 和 score 检验的区别仅在于它们计算标准误的方式，似然比检验通常被认为比 Wald 或 score 检验更准确（这将在第 6 章中进一步讨论）。表中列出了使用这些不同方法的示例的测试结果摘要3.2，本章末尾的计算部分（第 3.5 节）显示了如何使用计算机软件获得结果。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|Proportions, Estimation, and Goodness-of-Fit

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|Proportions, Estimation, and Goodness-of-Fit

统计代写|属性数据分析作业代写analysis of categorical data代考|Maximum Likelihood Estimation: A Single Proportion

In estimating a population parameter (e.g., a population proportion), we use information from the sample to compute a statistic (e.g., a sample proportion) that optimally represents the parameter in some way. The term maximum likelihood estimate refers to the value of the parameter that is most probable, given the sample data, according to the appropriate underlying probability distribution.

To demonstrate this estimation procedure with a computationally simple example, suppose that we select a random sample of 10 students from the population of all students in the United States and record whether each student is proficient (a “success”, in the terminology of Chapter 2) or not proficient in mathematics. Here the proficiency outcome for each student is a Bernoulli trial, and there are $n=10$ such trials, so the appropriate underlying distribution

for this process is the binomial. Recall (from Chapter 2) that the binomial probability of $k$ successes in $n$ independent “trials” is computed as
$$
P(Y=k)=\left(\begin{array}{l}
n \
k
\end{array}\right) \pi^{k}(1-\pi)^{(n-k)} .
$$
Using Equation 3.1, suppose that in our example 4 of the 10 students were proficient in mathematics. The probability would thus be computed by substituting $n=10$ and $k=4$ into Equation 3.1, so
$$
P(Y=4)=\left(\begin{array}{c}
10 \
4
\end{array}\right) \pi^{4}(1-\pi)^{(10-4)}
$$
We can now evaluate this probability using different values of $\pi$, and the maximum likelihood estimate is the value of $\pi$ at which the probability (likelihood) is highest (maximized). For example, if $\pi=0.3$, the probability of 4 (out of the 10 ) students being proficient is
$$
\begin{aligned}
P(Y=4)=\left(\begin{array}{c}
10 \
4
\end{array}\right)(0.3)^{4}(1-0.3)^{(10-4)} \
&=\frac{(10 \times 9 \times 8 \times 7 \times 6 \times 5 \times 4 \times 3 \times 2 \times 1)}{(4 \times 3 \times 2 \times 1)(6 \times 5 \times 4 \times 3 \times 2 \times 1)}(0.3)^{4}(0.7)^{6} \
&=210(0.0081)(0.1176)=0.20 .
\end{aligned}
$$
Similarly, if $\pi=0.4$, the probability is
$$
P(Y=4)=\left(\begin{array}{c}
10 \
4
\end{array}\right)(0.4)^{4}(1-0.4)^{(10-4)}=210(0.4)^{4}(0.6)^{6}=0.25 .
$$
The probabilities for the full range of possible $\pi$ values are shown in Figure 3.1, which demonstrates that the value of $\pi$ that maximizes the probability (or likelihood) in our example

is $0.40$. This means that the value of $0.40$ is an ideal estimate of $\pi$ in the sense that it is most probable, or likely, given the observed data. In fact, the maximum likelihood estimate of a proportion is equal to the sample proportion, computed as $p=k / n=4 / 10=0.40$.
In general, the maximum likelihood estimation method is an approach to obtaining sample estimates that is useful in a variety of contexts as well as in cases where a simple computation does not necessarily provide an ideal estimate. We will use the concept of maximum likelihood estimation throughout this book.

So far, we have discussed the concept of maximum likelihood estimation and shown that we can use the sample proportion, $p=k / n$, to obtain the maximum likelihood estimate (MLE) of the population proportion, $\pi$. This is akin to what is done with more familiar parameters, such as the population mean, where the MLE is the sample mean and it is assumed that responses follow an underlying normal distribution. The inferential step, in the case of the population mean, involves testing whether the sample mean differs from what it is hypothesized to be in the population and constructing a confidence interval for the value of the population mean based on its sample estimate. Similarly, in our example we can infer whether the proportion of students found to be proficient in mathematics in the sample differs from the proportion of students hypothesized to be proficient in mathematics in the population. We can also construct a confidence interval for the proportion of students proficient in mathematics in the population based on the estimate obtained from the sample. We now turn to a discussion of inferential procedures for a proportion.

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing for a Single Proportion

In testing a null hypothesis for a single population mean, where the variable of interest is continuous, a test statistic is constructed and evaluated against the probabilities of the normal distribution. In the case of testing a null hypothesis for a single population proportion, however, the variable of interest is discrete and several hypothesis-testing methods are available. We will discuss methods that use the probabilities from the binomial distribution as well as methods that use a continuous distribution to approximate the binomial distribution. Computer programs and output for illustrative examples are provided at the end of the chapter.

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing Using the Binomial Distribution

Recall from Chapter 2 that the probability of any dichotomous outcome (i.e., the number of successes, $k$ ) can be computed using the binomial probability distribution. For example, suppose that a goal for the rate of mathematics proficiency is set at $80 \%$, and in a random sample of 10 students $70 \%$ of students were found to be proficient in mathematics. In this case, we may wish to test whether the proportion of students who are proficient in mathematics in the population is significantly different than the goal of $80 \%$. In other words, we would like to know whether our obtained sample proportion of $0.7$ is significantly lower than $0.8$. To do so, we would test the null hypothesis $H_{0}: \pi=0.8$ against the (one-sided, in this case) alternative $H_{1}: \pi<0.8$.

In this example, using our sample of $n=10$ students, the probability of each outcome $(k=0,1, \ldots, 10)$ can be computed under the null hypothesis (where $\pi=0.8)$ using the binomial distribution:
$$
P(Y=k)=\left(\begin{array}{c}
10 \
k
\end{array}\right) 0.8^{k}(1-0.8)^{(10-k)}
$$The resulting probabilities (which make up the null distribution) are shown in Table $3.1$ and Figure 3.2. Using the conventional significance level of $\alpha=0.05$, any result that is in the lowest $5 \%$ of the null distribution would lead to rejection of $H_{0}$. From the cumulative probabilities in Table 3.1, which indicate the sum of the probabilities up to and including a given value of $k$, we can see that the lowest $5 \%$ of the distribution consists of the $k$ values 0 through 5 . For values of $k$ above 5 , the cumulative probability is greater than $5 \%$. Because our sample result of $p=0.7$ translates to $k=7$ when $n=10$, we can see that this result is not in the lowest $5 \%$ of the distribution and does not provide sufficient evidence for rejecting $H_{0}$. In other words, our result is not sufficiently unusual under the null distribution and we cannot reject the null hypothesis that $\pi=0.8$. To put it another way, the sample result of $p=0.7$ is sufficiently consistent (or not inconsistent) with the notion that $80 \%$ of the students in the population (represented by the sample) are indeed proficient in mathematics. On the other hand, if we had obtained a sample proportion of $p=0.5$ (i.e., $k=5$ ), our result would have been in the lowest $5 \%$ of the distribution and we would have rejected the null hypothesis that

$\pi=0.8$. In this case we would have concluded that, based on our sample proportion, it would be unlikely that $80 \%$ of the students in the population are proficient in mathematics.

We can also compute $p$-values for these tests using the null distribution probabilities and observing that, if the null hypothesis were true, the lower-tailed probability of obtaining a result at least as extreme as the sample result of $p=0.7$ (or $k=7$ ) is
$$
P(Y=0)+P(Y=1)+P(Y=2)+\cdots+P(Y=7)=0.322 \text {, }
$$
which is also the cumulative probability (see Table 3.1) corresponding to $k=7$. To conduct a two-tailed test, in which the alternative hypothesis is $H_{1}: \pi \neq 0.8$, the one-tailed $p$-value would typically be doubled. In our example, the two-tailed $p$-value would thus be $2(0.322)=0.644$.
Note that if only $50 \%$ of the students in our sample were found to be proficient in mathematics, then the lower-tailed probability of obtaining a result at least as extreme as the sample result of $p=0.5$ (or $k=5$ ) would be
$$
P(Y=0)+P(Y=1)+P(Y=2)+\cdots+P(Y=5)=0.032,
$$
which is also the cumulative probability (see Table 3.1) corresponding to $k=5$. Alternatively, the two-tailed $p$-value for this result would be $2(0.032)=0.064$.

There are two main drawbacks to using this method for hypothesis testing. First, if the number of observations (or trials) is large, the procedure requires computing and summing a large number of probabilities. In such a case, approximate methods work just as well, and these are discussed in the following sections. Second, the $p$-values obtained from this method are typically a bit too high, and the test is thus overly conservative; this means that when the significance level is set at $0.05$ and the null hypothesis is true, it is not rejected $5 \%$ of the time (as would be expected) but less than 5\% of the time (Agresti, 2007). Methods that adjust the $p$-value so that it is more accurate are beyond the scope of this book but are discussed in Agresti (2007) as well as Agresti and Coull (1998).

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|Maximum Likelihood Estimation: A Single Proportion

在估计总体参数（例如，总体比例）时，我们使用来自样本的信息来计算以某种方式最佳地表示参数的统计量（例如，样本比例）。术语最大似然估计是指在给定样本数据的情况下，根据适当的潜在概率分布最可能的参数值。

为了用一个计算简单的例子来演示这个估计过程，假设我们从美国所有学生中随机抽取 10 名学生作为样本，并记录每个学生是否精通（第 2 章的术语中的“成功”） ) 或不精通数学。这里每个学生的熟练程度结果是一个伯努利试验，并且有n=10这样的试验，所以适当的基础分布

因为这个过程是二项式的。回想一下（从第 2 章中），到成功n独立的“试验”计算为
磷(是=到)=(n 到)圆周率到(1−圆周率)(n−到).
使用公式 3.1，假设在我们的示例中，10 名学生中有 4 名精通数学。因此，概率将通过代入来计算n=10和到=4进入方程 3.1，所以
磷(是=4)=(10 4)圆周率4(1−圆周率)(10−4)
我们现在可以使用不同的值来评估这个概率圆周率，最大似然估计是圆周率概率（可能性）最高（最大化）。例如，如果圆周率=0.3，有 4 名（10 名）学生精通的概率为
磷(是=4)=(10 4)(0.3)4(1−0.3)(10−4) =(10×9×8×7×6×5×4×3×2×1)(4×3×2×1)(6×5×4×3×2×1)(0.3)4(0.7)6 =210(0.0081)(0.1176)=0.20.
同样，如果圆周率=0.4，概率为
磷(是=4)=(10 4)(0.4)4(1−0.4)(10−4)=210(0.4)4(0.6)6=0.25.
全部可能的概率圆周率值如图 3.1 所示，这表明圆周率在我们的例子中最大化概率（或可能性）

是0.40. 这意味着0.40是一个理想的估计圆周率从某种意义上说，鉴于观察到的数据，这是最有可能的。事实上，一个比例的最大似然估计等于样本比例，计算为p=到/n=4/10=0.40.
一般来说，最大似然估计方法是一种获得样本估计的方法，该方法在各种情况下以及在简单计算不一定提供理想估计的情况下都很有用。我们将在本书中使用最大似然估计的概念。

到目前为止，我们已经讨论了最大似然估计的概念，并表明我们可以使用样本比例，p=到/n，以获得总体比例的最大似然估计（MLE），圆周率. 这类似于使用更熟悉的参数（例如总体均值）所做的事情，其中 MLE 是样本均值，并且假设响应遵循潜在的正态分布。在总体均值的情况下，推断步骤涉及测试样本均值是否与总体中的假设值不同，并根据其样本估计构建总体均值的置信区间。同样，在我们的例子中，我们可以推断出样本中被认为精通数学的学生比例是否不同于总体中被假设为精通数学的学生比例。我们还可以根据从样本中获得的估计，为总体中精通数学的学生的比例构建置信区间。我们现在转向讨论比例的推理过程。

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing for a Single Proportion

在对单个总体均值进行零假设检验时，其中感兴趣的变量是连续的，将构建检验统计量并根据正态分布的概率进行评估。然而，在为单一总体比例检验零假设的情况下，感兴趣的变量是离散的，并且有几种假设检验方法可用。我们将讨论使用二项式分布概率的方法以及使用连续分布来近似二项式分布的方法。本章末尾提供了用于说明性示例的计算机程序和输出。

统计代写|属性数据分析作业代写analysis of categorical data代考|Hypothesis Testing Using the Binomial Distribution

回想一下第 2 章中任何二分结果的概率（即成功的次数，到) 可以使用二项式概率分布来计算。例如，假设数学熟练程度的目标设定为80%，并且在 10 个学生的随机样本中70%的学生被发现精通数学。在这种情况下，我们不妨检验一下，数学精通的学生在人群中的比例是否与目标有显着差异。80%. 换句话说，我们想知道我们获得的样本比例是否0.7明显低于0.8. 为此，我们将检验原假设H0:圆周率=0.8反对（单方面，在这种情况下）替代方案H1:圆周率<0.8.

在这个例子中，使用我们的样本n=10学生，每个结果的概率(到=0,1,…,10)可以在零假设下计算（其中圆周率=0.8)使用二项分布：
磷(是=到)=(10 到)0.8到(1−0.8)(10−到)结果概率（构成零分布）如表所示3.1和图 3.2。使用常规显着性水平一种=0.05, 任何最低的结果5%零分布将导致拒绝H0. 来自表 3.1 中的累积概率，它表示直到并包括给定值的概率总和到，我们可以看到最低5%的分布包括到值 0 到 5 。对于值到大于 5 ，累积概率大于5%. 因为我们的样本结果p=0.7翻译成到=7什么时候n=10，我们可以看到这个结果并不是最低的5%的分布，并没有提供足够的证据拒绝H0. 换句话说，我们的结果在零分布下不是足够不寻常的，我们不能拒绝零假设圆周率=0.8. 换句话说，样本结果p=0.7与以下概念充分一致（或不不一致）80%人口中的学生（由样本代表）确实精通数学。另一方面，如果我们获得了样本比例p=0.5（IE，到=5)，我们的结果将是最低的5%的分布，我们会拒绝原假设

圆周率=0.8. 在这种情况下，我们会得出结论，根据我们的样本比例，不太可能80%人口中的学生精通数学。

我们还可以计算p- 使用零分布概率的这些检验的值，并观察到，如果零假设为真，则获得至少与样本结果一样极端的结果的下尾概率p=0.7（或者到=7）是
磷(是=0)+磷(是=1)+磷(是=2)+⋯+磷(是=7)=0.322,
这也是对应于的累积概率（见表 3.1）到=7. 进行双尾检验，其中备择假设是H1:圆周率≠0.8, 单尾p-value 通常会翻倍。在我们的示例中，双尾p-value 因此将是2(0.322)=0.644.
请注意，如果只有50%我们样本中的学生被发现精通数学，那么获得结果的低尾概率至少与样本结果一样极端p=0.5（或者到=5）将会
磷(是=0)+磷(是=1)+磷(是=2)+⋯+磷(是=5)=0.032,
这也是对应于的累积概率（见表 3.1）到=5. 或者，双尾p- 这个结果的值是2(0.032)=0.064.

使用这种方法进行假设检验有两个主要缺点。首先，如果观察（或试验）的数量很大，则该过程需要计算和汇总大量概率。在这种情况下，近似方法同样有效，这些将在以下部分中讨论。二、p- 从这种方法获得的值通常有点太高，因此测试过于保守；这意味着当显着性水平设置为0.05并且原假设为真，它不会被拒绝5%的时间（正如预期的那样）但不到 5% 的时间（Agresti，2007）。调整方法p值以使其更准确超出了本书的范围，但在 Agresti (2007) 以及 Agresti and Coull (1998) 中进行了讨论。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|The Poisson Distribution

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|The Poisson Distribution

The Poisson distribution is similar to the binomial distribution in that both distributions are used to model count data that varies randomly over time. In fact, as the number of trials gets larger (i.e., $n \rightarrow \infty$ ), the binomial distribution and the Poisson distribution tend to converge when the probability of success remains fixed. The major difference between the binomial and the Poisson distributions is that for the binomial distribution the number of observations (trials) is fixed, whereas for the Poisson distribution the number of observations is not fixed but rather the period of time in which the observations occur must be fixed. In other words, all eligible cases are studied when categorical data arise from the binomial distribution, whereas only those cases with a particular outcome in a fixed time interval are studied when data arise from the Poisson distribution.

For example, suppose that a researcher was interested in studying the number of accidents at a particular highway interchange in a 24-hour period. To study this phenomenon using the binomial distribution, the researcher would have to know the total number of cars (i.e., $n$ ) that had traveled through the particular interchange in a 24 -hour period as well as the number of cars that had been involved in an accident at the particular interchange in this 24-hour period (which, when divided by $n$, provides $\pi$ or $p$ ). This is because the binomial distribution assumes that there are two possible outcomes to the phenomenon: success or failure. On the other hand, to study this phenomenon using the Poisson distribution, the researcher would only need to know the mean number of cars that had been involved in an accident at the particular interchange in a 24-hour period, which is arguably much easier to obtain. For this reason, the Poisson distribution is often used when the probability of success is very small.

In general, if $\lambda$ equals the number of successes expected to occur in a fixed interval of time, then the probability of observing $k$ successes in that time interval can be expressed by
$$
P(Y=k)=\frac{e^{-\lambda} \lambda^{k}}{k !}
$$
For example, suppose that over the last 50 years the average number of suicide attempts that occurred at a particular correctional facility each year is approximately 2.5. The Poisson distribution can be used to determine the probability that five suicide attempts will occur at this particular correctional facility in the next year. Specifically,
$$
P(Y=5) \frac{e^{-\lambda} \lambda^{5}}{5 !}=\frac{e^{-2.5} 2.5^{5}}{5 !}=\frac{(0.082)(2.5)^{5}}{5(4)(3)(2)(1)}=0.067
$$
Similarly, the probabilities for a variety of outcomes can be examined to determine the most likely number of suicides that will occur at this particular correctional facility in the next year. For example,
$$
P(1 \text { suicide attempt will occur })=\frac{e^{-2.5} 2.5^{1}}{1 !}=\frac{(0.082)(2.5)}{1}=0.205
$$
$P(2$ suicide attempts will occur $)=\frac{e^{-2.5} 2.5^{2}}{2 !}=\frac{(0.082)(2.5)^{2}}{2(1)}=0.256$
$P(3$ suicide attempts will occur $)=\frac{e^{-2.5} 2.5^{3}}{3 !}=\frac{(0.082)(2.5)^{3}}{3(2)(1)}=0.214$

$P(4$ suicide attempts will occur $)=\frac{e^{-2.5} 2.5^{4}}{4 !}=\frac{(0.082)(2.5)^{4}}{4(3)(2)(1)}=0.138$,
$P(5$ suicide attempts will occur $)=\frac{e^{-2.5} 2.5^{5}}{5 !}=\frac{(0.082)(2.5)^{5}}{5(4)(3)(2)(1)}=0.067$,
$P(6$ suicide attempts will occur $)=\frac{e^{-2.5} 2.5^{6}}{6 !}=\frac{(0.082)(2.5)^{6}}{6(5)(4)(3)(2)(1)}=0.028$,
$P(7$ suicide attempts will occur $)=\frac{e^{-2.5} 2.5^{7}}{7 !}=\frac{(0.082)(2.5)^{7}}{7(6)(5)(4)(3)(2)(1)}=0.010$,
$P(8$ suicide attempts will occur $)=\frac{e^{-25} 2.5^{8}}{8 !}=\frac{(0.082)(2.5)^{8}}{8(7)(6)(5)(4)(3)(2)(1)}=0.003$, and
$P(9$ suicide attempts will occur $)=\frac{e^{-2.5} 2.5^{9}}{9 !}=\frac{(0.082)(2.5)^{8}}{9(8)(7)(6)(5)(4)(3)(2)(1)}=<0.001$.
Figure $2.1$ depicts this distribution graphically, and Figure $2.2$ depicts a comparable distribution if the average number of suicide attempts in a year had only been equal to 1 . Comparing the two figures, note that the expected (i.e., average or most likely) number of suicides in any given year at this particular correctional facility, assuming that the number of suicides follows the Poisson distribution, is greater in Figure $2.1$ than in Figure 2.2, as would be expected. Moreover, the likelihood of having a high number of suicides (e.g., four or more) is much lower in Figure $2.2$ than in Figure 2.1, as would be expected.

Figure $2.3$ illustrates a comparable distribution where the average number of suicide attempts in a year is equal to 5 . Notice that Figure $2.3$ is somewhat reminiscent of a normal distribution. In fact, as $\lambda$ gets larger, the Poisson distribution tends to more closely resemble the normal distribution and, for sufficiently large $\lambda$, the normal distribution is an excellent approximation to the Poisson distribution (Cheng, 1949).

统计代写|属性数据分析作业代写analysis of categorical data代考|Summary

In this chapter, we discussed various probability distributions that are often used to model discrete random variables. The distribution that is most appropriate for a given situation depends on the random process that is modeled and the parameters that are needed or available in that situation. A summary of the distributions we discussed is provided in Table 2.3.

In the next chapter, we make use of some of these distributions for inferential procedures; specifically, we will discuss how to estimate and test hypotheses about a population proportion based on the information obtained from a random sample.

统计代写|属性数据分析作业代写analysis of categorical data代考|Problems

A divorce lawyer must choose 5 out of 25 people to sit on the jury that is to help decide how much alimony should be paid to his client, the ex-husband of a wealthy business woman. As luck would have it, 12 of the possible candidates are very bitter about having to pay alimony to their ex-spouses. If the lawyer were to choose jury members at random, what is the probability that none of the five jury members he chooses are bitter about having to pay alimony?

At Learn More School, 15 of the 20 students in second grade are proficient in reading.
a. If the principal of the school were to randomly select two second-grade students to represent the school in a poetry reading contest, what is the probability that both of the students chosen will be proficient in reading?
b. What is the probability that only one of the two students selected will be proficient in reading?
c. If two students are selected, what is the expected number of students that are proficient in reading?

Suppose there are 48 Republican senators and 52 Democrat senators in the United States Senate and the president of the United States must appoint a special committee of 6 senators to study the issues related to poverty in the United States. If the special committee is appointed by randomly selecting senators, what is the probability that half of the committee consists of Republican senators and half of the committee consists of Democrat senators?
The CEO of a toy company would like to hire a vice president of sales and marketing. Only 2 of the 5 qualified applicants are female, and the CEO would really like to hire a female VP if at all possible to increase the diversity of his administrative cabinet. If he randomly chooses an applicant from the pool, what is the probability that the applicant chosen will be a female?

Suppose that the principal of Learn More School from Problem $2.2$ is only able to choose one second-grade student to represent the school in a poetry contest. If she randomly selects a student, what is the probability that the student will be proficient in reading?
Researchers at the Food Institute have determined that $67 \%$ of women tend to crave sweets over other alternatives. If 10 women are randomly sampled from across the

country, what is the probability that only 3 of the women sampled will report craving sweets over other alternatives?

For a multiple-choice test item with four response options, the probability of obtaining the correct answer by simply guessing is $0.25$. If a student simply guessed on all 20 items in a multiple-choice test:
a. What is the probability that the student would obtain the correct answers to 15 of the 20 items?
b. What is the expected number of items the student would answer correctly?
The probability that an entering college freshman will obtain his or her degree in four years is 0.4. What is the probability that at least one out of five admitted freshmen will graduate in four years?

An owner of a boutique store knows that $45 \%$ of the customers who enter her store will make purchases that total less than $\$ 200,15 \%$ of the customers will make purchases that total more than $\$ 200$, and $40 \%$ of the customers will simply be browsing. If five customers enter her store on a particular afternoon, what is the probability that exactly two customers will make a purchase that totals less than $\$ 200$ and exactly one customer will make a purchase that totals more than $\$ 200$ ?
On average, 10 people enter a particular bookstore every 5 minutes.
a. What is the probability that only four people enter the bookstore in a 5-minute interval?
b. What is the probability that eight people enter the bookstore in a 5 -minute interval?
Telephone calls are received by a college switchboard at the rate of four calls every 3 minutes. What is the probability of obtaining five calls in a 3 -minute interval?

Provide a substantive illustration of a situation that would require the use of each of the five probability distributions described in this chapter

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|The Poisson Distribution

泊松分布类似于二项分布，因为这两种分布都用于对随时间随机变化的计数数据进行建模。事实上，随着试验次数的增加（即，n→∞)，当成功概率保持固定时，二项分布和泊松分布趋于收敛。二项分布和泊松分布之间的主要区别在于，对于二项分布，观察（试验）的数量是固定的，而对于泊松分布，观察的数量不是固定的，而是观察发生的时间段必须被固定。换句话说，当分类数据来自二项分布时，所有符合条件的案例都会被研究，而当数据来自泊松分布时，只有那些在固定时间间隔内具有特定结果的案例才会被研究。

例如，假设研究人员有兴趣研究 24 小时内特定高速公路交汇处的事故数量。要使用二项分布研究这种现象，研究人员必须知道汽车的总数（即，n) 在 24 小时内通过特定立交桥的车辆，以及在此 24 小时内在特定立交桥发生事故的汽车数量（除以n, 提供圆周率或者p）。这是因为二项分布假设该现象有两种可能的结果：成功或失败。另一方面，要使用泊松分布来研究这种现象，研究人员只需要知道 24 小时内特定交汇处发生事故的汽车平均数量，这可以说更容易获得。出于这个原因，泊松分布通常在成功概率很小的情况下使用。

一般来说，如果λ等于预期在固定时间间隔内发生的成功次数，然后是观察到的概率到在该时间间隔内的成功可以表示为
磷(是=到)=和−λλ到到!
例如，假设在过去的 50 年中，每年在特定惩教机构发生的平均自杀未遂次数约为 2.5 次。泊松分布可用于确定该特定惩教设施明年发生五次自杀未遂的概率。具体来说，
磷(是=5)和−λλ55!=和−2.52.555!=(0.082)(2.5)55(4)(3)(2)(1)=0.067
同样，可以检查各种结果的概率，以确定明年在这个特定惩教机构发生的最有可能的自杀人数。例如，
磷(1 会发生自杀未遂 )=和−2.52.511!=(0.082)(2.5)1=0.205
磷(2会发生自杀企图)=和−2.52.522!=(0.082)(2.5)22(1)=0.256
磷(3会发生自杀企图)=和−2.52.533!=(0.082)(2.5)33(2)(1)=0.214

磷(4会发生自杀企图)=和−2.52.544!=(0.082)(2.5)44(3)(2)(1)=0.138,
磷(5会发生自杀企图)=和−2.52.555!=(0.082)(2.5)55(4)(3)(2)(1)=0.067,
磷(6会发生自杀企图)=和−2.52.566!=(0.082)(2.5)66(5)(4)(3)(2)(1)=0.028,
磷(7会发生自杀企图)=和−2.52.577!=(0.082)(2.5)77(6)(5)(4)(3)(2)(1)=0.010,
磷(8会发生自杀企图)=和−252.588!=(0.082)(2.5)88(7)(6)(5)(4)(3)(2)(1)=0.003，和
磷(9会发生自杀企图)=和−2.52.599!=(0.082)(2.5)89(8)(7)(6)(5)(4)(3)(2)(1)= <0.001.
数字2.1以图形方式描绘了这种分布，并且图2.2如果一年中的平均自杀未遂次数仅等于 1 ，则描述了可比分布。比较这两个数字，请注意，假设自杀人数服从泊松分布，则该特定惩教机构在任何给定年份的预期（即平均或最可能）自杀人数在图2.1与图 2.2 相比，正如预期的那样。此外，在图 1 中，自杀人数较多（例如，四人或更多）的可能性要低得多。2.2与图 2.1 相比，正如预期的那样。

数字2.3说明了一个可比较的分布，其中一年中的平均自杀未遂次数等于 5 。请注意，图2.3有点让人联想到正态分布。事实上，作为λ变大，泊松分布趋向于更接近正态分布，并且，对于足够大的λ, 正态分布是泊松分布的极好近似 (Cheng, 1949)。

统计代写|属性数据分析作业代写analysis of categorical data代考|Summary

在本章中，我们讨论了经常用于对离散随机变量建模的各种概率分布。最适合给定情况的分布取决于建模的随机过程以及在该情况下需要或可用的参数。表 2.3 中提供了我们讨论的分布的摘要。

在下一章中，我们将这些分布中的一些用于推理过程；具体来说，我们将讨论如何根据从随机样本中获得的信息来估计和检验关于总体比例的假设。

统计代写|属性数据分析作业代写analysis of categorical data代考|Problems

离婚律师必须从 25 人中选出 5 人作为陪审团成员，以帮助决定应向其委托人（一位富有的女商人的前夫）支付多少赡养费。幸运的是，有 12 名可能的候选人对必须向前配偶支付赡养费感到非常痛苦。如果律师随机选择陪审团成员，他选择的五个陪审团成员中没有一个对支付赡养费感到苦恼的概率是多少？

在 Learn More School，二年级的 20 名学生中有 15 人精通阅读。
一种。如果学校校长随机选择两名二年级学生代表学校参加诗歌朗诵比赛，那么这两个学生都精通阅读的概率是多少？
湾。被选中的两个学生中只有一个精通阅读的概率是多少？
C。如果选择了两名学生，预计精通阅读的学生人数是多少？

假设美国参议院有 48 名共和党参议员和 52 名民主党参议员，美国总统必须任命一个由 6 名参议员组成的特别委员会来研究美国与贫困有关的问题。如果特别委员会是通过随机选择参议员任命的，那么委员会一半由共和党参议员组成，而委员会一半由民主党参议员组成的概率是多少？
一家玩具公司的 CEO 想聘请一位销售和营销副总裁。5 名合格的申请者中只有 2 名是女性，如果可能的话，CEO 真的很想聘请一位女性副总裁，以增加行政内阁的多样性。如果他从池中随机选择一个申请人，那么这个申请人是女性的概率是多少？

假设 Learn More School from Problem 的校长2.2只能选择一名二年级学生代表学校参加诗歌比赛。如果她随机选择一个学生，这个学生精通阅读的概率是多少？
食品研究所的研究人员已经确定67%的女性倾向于比其他替代品更渴望甜食。如果从全国随机抽取 10 名女性

国家，只有 3 名被抽样的女性会报告比其他替代品更想吃甜食的概率是多少？

对于一个有四个回答选项的选择题，通过简单的猜测得到正确答案的概率为0.25. 如果一个学生在多项选择题测试中简单地猜测了所有 20 个项目：
a. 学生在 20 道题中答对 15 道题的概率是多少？
湾。学生正确回答的预期项目数是多少？
一个进入大学的新生在四年内获得学位的概率是 0.4。被录取的新生中至少有五分之一在四年内毕业的概率是多少？

一家精品店的老板都知道45%进入她商店的顾客的总购买量低于$200,15%的客户将购买的总金额超过$200，和40%的客户只会浏览。如果有五个顾客在一个特定的下午进入她的商店，恰好有两个顾客购买的商品总数小于$200并且恰好一位客户将购买的总金额超过$200?
平均每 5 分钟就有 10 人进入一家特定的书店。
一种。每隔 5 分钟只有四个人进入书店的概率是多少？
湾。8 个人每隔 5 分钟进入书店的概率是多少？
大学总机以每 3 分钟 4 个电话的速度接听电话。在 3 分钟间隔内获得五个电话的概率是多少？

提供需要使用本章描述的五个概率分布中的每一个的情况的实质性说明

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|The Bernoulli Distribution

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

Continuous and discrete probability distributions - Minitab Express — 统计代写|属性数据分析作业代写analysis of categorical data代考|The Bernoulli Distribution

统计代写|属性数据分析作业代写analysis of categorical data代考|The Bernoulli Distribution

The Bernoulli distribution is perhaps the simplest probability distribution for discrete variables and can be thought of as the building block for more complicated probability distributions used with categorical data. A discrete variable that comes from a Bernoulli distribution can only take on one of two values, such as pass/fail or proficient/not proficient. This distribution can be used to determine the probability of success (i.e., the outcome of interest) if only one draw is made from a finite population. Therefore, this distribution is a special case of the hypergeometric distribution with $n=1$.

Using the example presented earlier, if $n=1$ then only one applicant is to be selected for admission. Because this candidate can either be a minority applicant or not, the number of successes, $k$, is also equal to 1 . In this case, to compute the probability of a success $(k=1)$, Equation $2.1$ can be simplified as follows:
$$
P(Y=k=1)=\frac{\left(\begin{array}{c}
m \
k
\end{array}\right)\left(\begin{array}{c}
N-m \
n-k
\end{array}\right)}{\left(\begin{array}{c}
N \
n
\end{array}\right)}=\frac{\left(\begin{array}{c}
m \
1
\end{array}\right)\left(\begin{array}{c}
N-m \
0
\end{array}\right)}{\left(\begin{array}{c}
N \
1
\end{array}\right)}=\frac{\left(\frac { m ! } { ( m – 1 ) ! ( 1 ! ) ) ) ( ( N – m ) ! } \left(\frac{(N-m) !}{(N-1) !(1 !))}\right.\right.}{\left(\frac{m}{N}\right.}
$$
Therefore, the probability of selecting a minority applicant from the pool, which can be thought of as the probability of a single success, is simply the number of minority candidates divided by the total number of candidates. In other words, it simply represents the proportion of minority candidates in the candidate pool. This probability is typically denoted by $\pi$ in the population (or $p$ in the sample), and is equal to $\pi=3 / 15=0.2$ in this example. Using the laws of probability, the probability of not selecting a minority candidate from the candidate pool can be obtained by $1-\pi$, which in our example is $1-0.2=0.8$. The mean, or expected value, for this distribution is simply $\mu=\pi$ and the variance for this distribution can be calculated using the formula $\sigma^{2}=\pi(1-\pi)$, which is $0.2(0.8)=0.16$ for the example presented.

统计代写|属性数据分析作业代写analysis of categorical data代考|The Binomial Distribution

Categorical variables that follow the binomial distribution can only take on one of two possible outcomes, as is the case with variables that follow the Bernoulli distribution. However, whereas the Bernoulli distribution only deals with one trial, outcome, or event, the binomial distribution deals with multiple trials (denoted by $n$ ). Therefore, the binomial distribution can be thought of as an extension of the Bernoulli distribution and can also be considered to be akin to the hypergeometric distribution.

Like the hypergeometric distribution, the binomial distribution can be used with discrete variables when trying to determine the number of successes in a sequence of $n$ trials that are drawn from a finite population. Unlike the hypergeometric distribution, where sampling is conducted without replacement, the events or trials are independent when they follow a binomial distribution. With the hypergeometric distribution there is a dependency among the events considered because sampling is done without replacement; that is, the probability of success may become larger in subsequent trials, and this change can be dramatic if the population is small. With the binomial distribution, the probability of success does not change in subsequent trials because the probability of success for each trial is independent of previous or subsequent trials; in other words, sampling is conducted with replacement. For example, to draw 2 applicants from a total of 15 , sampling without replacement proceeds such that once the first applicant has been selected for admission there are only 14 remaining applicants from which to select the second admission. Therefore, the probability of success for the second selection differs from the probability of success for the first selection. On the other hand, sampling with replacement proceeds such that each selection is considered to be drawn from the pool of all 15 individuals. In other words, the first selection is not removed from the total sample before the second selection is made, and each selection has the same probability, thereby making each selection independent of all other selections. In this case, it is possible that the same applicant will be selected on both draws, which is conceptually nonsensical. However, this is very unlikely with large samples, which is why the binomial distribution is often used with large samples even though conceptually the hypergeometric distribution may be more appropriate. In fact, the binomial distribution is an excellent approximation to the hypergeometric distribution if the size of the population is relatively large compared to the number of cases that are to be sampled from the population (Sheskin, 2007).

In addition, the binomial distribution is more appropriate when selections are truly independent of each other. For example, suppose the probability that a female applicant will be admitted to an engineering program (a success) is $0.8$ across all such programs. In this case, to determine the probability that 3 out of 15 (or, more generally, $k$ out of $n$ ) engineering programs would admit a female applicant, the number of trials would be $n=15$ and the trials would be considered independent given that each program’s admission decision is not influenced by any other program’s decision.

In general, for a series of $n$ independent trials, each resulting in only one of two particular outcomes, where $\pi$ is the probability of “success” and $1-\pi$ is the probability of “failure”, the probability of observing $k$ “successes” can be expressed by:
$$
P(Y=k)=\left(\begin{array}{l}
n \
k
\end{array}\right) \pi^{k}(1-\pi)^{(\mathrm{n}-k)}
$$
For example, suppose that the probability of being proficient in mathematics (a success) is $0.7$ and three students are chosen at random from a particular school. The binomial distribution can be used, for example, to determine the probability that of the three randomly selected students $(n=3)$, all three are proficient in mathematics $(k=3)$ :
$$
P(Y=3)=\left(\begin{array}{l}
3 \
3
\end{array}\right) 0.7^{3}(0.3)^{(3-3)}=0.7^{3}=0.343
$$

Further, the probabilities of all possible outcomes can be computed to determine the most likely number of students that will be found to be proficient. Specifically:

The probability that none of the three students is proficient
$$
=P(Y=0)=\left(\begin{array}{l}
3 \
0
\end{array}\right) 0.7^{0}(0.3)^{(3-0)}=0.3^{3}=0.027 .
$$
The probability that one of the three students is proficient
$$
=P(Y=1)=\left(\begin{array}{l}
3 \
1
\end{array}\right) 0.7^{1}(0.3)^{(3-1)}=3(0.7)^{1}(0.3)^{2}=0.189 .
$$
The probability that two of the three students are proficient
$$
=P(Y=2)=\left(\begin{array}{l}
3 \
2
\end{array}\right) 0.7^{2}(0.3)^{(3-2)}=3(0.7)^{2}(0.3)^{1}=0.441
$$
Therefore, it is most likely that the number of mathematically proficient students will be two. Note that, since the preceding computations exhausted all possible outcomes, their probabilities should and do sum to one.

The mean and variance of the binomial distribution are expre.sed by $\mu=n \pi$ and $\sigma^{2}=n \pi(1-\pi)$, respectively. For our example, the mean of the distribution is $\mu=3(0.7)=2.1$, so the expected number of students found to be proficient in mathematics is $2.1$, which is comparable to the information that was obtained when the probabilities were calculated directly. Moreover, the variance of the distribution is $\sigma^{2}=3(0.7)(0.3)=0.63$, and the standard deviation is $\sigma=\sqrt{0.63}=0.794$.

统计代写|属性数据分析作业代写analysis of categorical data代考|The Multinomial Distribution

In general, for $n$ trials with $I$ possible outcomes, where $\pi_{i}=$ the probability of the $t^{\text {th }}$ outcome, the (joint) probability that $Y_{1}=k_{1}, Y_{2}=k_{2}, \ldots$, and $Y_{t}=k_{1}$ can be expressed by
$$
P\left(Y_{1}=k_{1}, Y_{2}=k_{2}, \cdots, \text { and } Y_{I}=k_{t}\right)=\frac{n !}{k_{1} ! k_{2} ! \cdots k_{l} !} \pi_{1}^{k_{1}} \pi_{2}^{k_{2}} \cdots \pi_{l}^{k_{l}} .
$$
Note that the sum of probabilities is
$$
\sum_{i=1}^{1} \pi_{i}=1
$$
and the sum of outcome frequencies is
$$
\sum_{i=1}^{I} k_{i}=n
$$
For example, suppose that the probability of being a minimal reader is $0.12$, the probability of being a basic reader is $0.23$, the probability of being a proficient reader is $0.47$, and the probability of being an advanced reader is $0.18$. If five students are chosen at random from a particular classroom, the multinomial distribution can be used to determine the probability that one student selected at random is a minimal reader, one student is a basic reader, two students are proficient readers, and one student is an advanced reader:
$$
\begin{aligned}
&P\left(Y_{1}=1, Y_{2}=1, Y_{3}=2, \text { and } Y_{4}=1\right) \
&=\frac{5 !}{(1 !)(1 !)(2 !)(1 !)}(0.12)^{1}(0.23)^{1}(0.47)^{2}(0.18)^{1} \
&=\frac{5(4)(3)(2)(1)}{2}(0.12)(0.23)(0.47)(0.47)(0.18)=0.066
\end{aligned}
$$
There are 120 different permutations in which the proficiency classification of students can be randomly selected in this example (e.g., the probability that all five students were advanced, the probability that one student is advanced and the other four students are proficient, and so on), thus we do not enumerate all possible outcomes here. Nonetheless, the multinomial distribution could be used to determine the probability of any of the 120 different permutations (i.e., combinations of outcomes) in a similar manner.

There are I means and variances for the multinomial distribution, each dealing with a particular outcome, $i$. Specifically, for the $i^{\text {th }}$ outcome, the mean can be expressed by $\mu_{i}=n \pi_{i}$ and the variance by $\sigma_{i}^{2}=n \pi_{i}\left(1-\pi_{j}\right)$. Table $2.2$ depicts these descriptive statistics for each of the four proficiency classifications in the previous example.

Probability Distribution — 统计代写|属性数据分析作业代写analysis of categorical data代考|The Bernoulli Distribution

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|The Bernoulli Distribution

伯努利分布可能是离散变量最简单的概率分布，可以被认为是用于分类数据的更复杂概率分布的构建块。来自伯努利分布的离散变量只能取两个值之一，例如通过/失败或精通/不精通。如果从有限的总体中仅抽取一次，则该分布可用于确定成功的概率（即感兴趣的结果）。因此，这个分布是超几何分布的一个特例n=1.

使用前面介绍的示例，如果n=1那么只有一名申请人将被录取。因为这个候选人可以是少数申请人，也可以不是，成功的次数，到, 也等于 1 。在这种情况下，计算成功的概率(到=1), 方程2.1可以简化如下：
磷(是=到=1)=(米到)(ñ−米 n−到)(ñ n)=(米 1)(ñ−米 0)(ñ 1)=(米!(米–1)!(1!)))((ñ–米)!((ñ−米)!(ñ−1)!(1!))(米ñ
因此，从候选人池中选出少数族裔申请人的概率，可以认为是单次成功的概率，简单来说就是少数族裔候选人的数量除以候选人总数。换句话说，它只是代表了候选人池中少数族裔候选人的比例。该概率通常表示为圆周率在人口中（或p在样本中），并且等于圆周率=3/15=0.2在这个例子中。使用概率定律，可以得到不从候选池中选择少数候选人的概率：1−圆周率，在我们的例子中是1−0.2=0.8. 此分布的均值或期望值很简单μ=圆周率并且可以使用以下公式计算此分布的方差σ2=圆周率(1−圆周率)，即0.2(0.8)=0.16对于给出的示例。

统计代写|属性数据分析作业代写analysis of categorical data代考|The Binomial Distribution

遵循二项分布的分类变量只能呈现两种可能的结果之一，就像遵循伯努利分布的变量一样。然而，伯努利分布只处理一个试验、结果或事件，而二项分布则处理多个试验（表示为n）。因此，二项分布可以被认为是伯努利分布的扩展，也可以被认为类似于超几何分布。

与超几何分布一样，二项分布可以与离散变量一起使用，以尝试确定一个序列中的成功次数。n从有限人群中抽取的试验。与超几何分布不同的是，在没有替换的情况下进行抽样，当事件或试验遵循二项分布时，它们是独立的。对于超几何分布，所考虑的事件之间存在依赖性，因为采样是在没有替换的情况下完成的；也就是说，在随后的试验中，成功的概率可能会变大，而如果人口较少，这种变化可能会很大。使用二项分布，成功概率在后续试验中不会改变，因为每次试验的成功概率与之前或之后的试验无关；换句话说，抽样是有放回的。例如，要从总共 15 个申请者中抽取 2 个申请者，无替换抽样，这样一旦第一个申请者被选中录取，就只剩下 14 名申请者可以从中选择第二个录取。因此，第二次选择的成功概率与第一次选择的成功概率不同。另一方面，替换抽样继续进行，使得每个选择都被认为是从所有 15 个人的池中抽取的。换句话说，在进行第二次选择之前，第一次选择不会从总样本中移除，并且每次选择具有相同的概率，从而使每个选择独立于所有其他选择。在这种情况下，可能会在两次抽签中选择同一申请人，这在概念上是荒谬的。然而，这对于大样本来说是不太可能的，这就是为什么二项分布经常用于大样本的原因，尽管从概念上讲超几何分布可能更合适。事实上，如果总体规模与要从总体中抽样的病例数相比相对较大，则二项式分布是超几何分布的极好近似（Sheskin，2007 年）。

In addition, the binomial distribution is more appropriate when selections are truly independent of each other. 例如，假设女性申请人被工程项目录取（成功）的概率为0.8在所有此类程序中。在这种情况下，要确定 15 个中有 3 个的概率（或者，更一般地说，到在……之外n) 工程项目将录取女性申请人，试验次数将是n=15鉴于每个项目的录取决定不受任何其他项目的决定的影响，这些试验将被认为是独立的。

一般来说，对于一系列n独立试验，每个试验只产生两个特定结果之一，其中圆周率是“成功”的概率和1−圆周率是“失败”的概率，观察的概率到“成功”可以表示为：
磷(是=到)=(n 到)圆周率到(1−圆周率)(n−到)
例如，假设精通数学（成功）的概率为0.7从特定学校随机选择三名学生。例如，二项分布可用于确定三个随机选择的学生的概率(n=3), 三人都精通数学(到=3):
磷(是=3)=(3 3)0.73(0.3)(3−3)=0.73=0.343

此外，可以计算所有可能结果的概率，以确定最有可能被认为精通的学生数量。具体来说：

三个学生都不精通的概率
=磷(是=0)=(3 0)0.70(0.3)(3−0)=0.33=0.027.
三个学生之一精通的概率
=磷(是=1)=(3 1)0.71(0.3)(3−1)=3(0.7)1(0.3)2=0.189.
三个学生中有两个精通的概率
=磷(是=2)=(3 2)0.72(0.3)(3−2)=3(0.7)2(0.3)1=0.441
因此，最有可能数学熟练的学生人数将是两个。请注意，由于前面的计算用尽了所有可能的结果，它们的概率应该并且确实总和为 1。

二项分布的均值和方差表示为μ=n圆周率和σ2=n圆周率(1−圆周率)，分别。对于我们的示例，分布的均值是μ=3(0.7)=2.1，因此发现精通数学的学生的预期数量为2.1，这与直接计算概率时获得的信息相当。此外，分布的方差为σ2=3(0.7)(0.3)=0.63，标准差为σ=0.63=0.794.

统计代写|属性数据分析作业代写analysis of categorical data代考|The Multinomial Distribution

一般来说，对于n试验一世可能的结果，其中圆周率一世=的概率吨th 结果，（联合）概率是1=到1,是2=到2,…，和是吨=到1可以表示为
磷(是1=到1,是2=到2,⋯, 和是一世=到吨)=n!到1!到2!⋯到一世!圆周率1到1圆周率2到2⋯圆周率一世到一世.
请注意，概率的总和是
∑一世=11圆周率一世=1
结果频率的总和是
∑一世=1一世到一世=n
例如，假设成为最小读者的概率是0.12，成为基础读者的概率为0.23，成为精通读者的概率为0.47，成为高级读者的概率为0.18. 如果从特定教室中随机选择 5 名学生，则可以使用多项分布来确定随机选择的一名学生是最小阅读者、一名学生是基本阅读者、两名学生是精通阅读者、一名学生是高级读者：
磷(是1=1,是2=1,是3=2, 和是4=1) =5!(1!)(1!)(2!)(1!)(0.12)1(0.23)1(0.47)2(0.18)1 =5(4)(3)(2)(1)2(0.12)(0.23)(0.47)(0.47)(0.18)=0.066
在这个例子中，有120种不同的排列可以随机选择学生的熟练程度分类（例如，所有5名学生都取得进步的概率，1名学生进步和其他4名学生熟练的概率，等等），因此我们没有在这里列举所有可能的结果。尽管如此，多项分布可用于以类似方式确定 120 种不同排列（即结果的组合）中的任何一种的概率。

多项分布有 I 均值和方差，每个都处理特定的结果，一世. 具体来说，对于一世th 结果，平均值可以表示为μ一世=n圆周率一世和方差σ一世2=n圆周率一世(1−圆周率j). 桌子2.2描述了前面示例中四个熟练程度分类中每一个的这些描述性统计数据。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|Probability Distributions

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

Probability distributions. Constructing probability distribution… | by dharmanath patil 🤟🏻 | Analytics Vidhya | Medium — 统计代写|属性数据分析作业代写analysis of categorical data代考|Probability Distributions

统计代写|属性数据分析作业代写analysis of categorical data代考|Probability Distributions for Categorical Variables

We begin with a simple example, where we suppose that the population of interest is a fifthgrade class at a middle school consisting of 50 students: 10 females and 40 males. In this case, if a teacher randomly selected one student from the class, the teacher would be more likely to choose a male student than a female student. In fact, because the exact number of male and female students in the population is known, the exact probability of randomly selecting a male or female student can be determined. Specifically, the probability of any particular outcome is defined as the number of ways a particular outcome can occur out of the total number of possible outcomes; therefore, the probability of randomly selecting a
8 Probability Distributions
male student in this example is $40 / 50=0.8$, and the probability of randomly selecting a female student is $10 / 50=0.2$.

However, contrary to the example with the population of fifth-grade students, it is atypical to know the exact specifications (i.e., distribution) of the population. The goal of inferential statistical procedures is to make inferences about the population from observed sample data, not the other way around. This is accomplished by considering a value that is obtained as the result of some experiment or data collection activity to be only one possible outcome out of many different outcomes that may have occurred; that is, this value is a variable because it can vary across different studies or experiments. For example, if 10 students were randomly selected from the fifth grade discussed earlier, the proportion of males in that sample of 10 students could be used to infer the proportion of males in the larger group or population (of all 50 students). If another random sample of 10 students was obtained, the proportion of males may not be equal to the proportion in the first sample; in that sense, the proportion of males is a variable. Random variable is a term that is used to describe the possible outcomes that a particular variable may take on. It does not describe the actual outcome itself and cannot be assigned a value, but is rather used to convey the fact that the outcome obtained was the result of some underlying random process. A probability distribution is a mathematical function that links the actual outcome obtained from the result of an experiment or data collection activity (e.g., a random sample) to the probability of its occurrence.

Most methods that deal with continuous dependent variables make the assumption that the values obtained are random observations that come from a normal distribution. In other words, when the dependent variable is continuous, it is assumed that a normal distribution is the underlying random process in the population from which the variable was obtained. However, there are many other probability distributions and, when the dependent variable is categorical, it can no longer be assumed to have been obtained from a population that is normally distributed. The purpose of this chapter is to describe several probability distributions that are assumed to underlie the population from which categorical data are obtained.

统计代写|属性数据分析作业代写analysis of categorical data代考|Frequency Distribution Tables for Discrete Variables

A discrete variable is a variable that can only take on a finite number of values. Categorical data almost always consist of discrete variables. One way to summarize data of this type is to construct a frequency distribution table, which depicts the number of responses in each category of the measurement scale as well as the probability of occurrence of a particular response category and the percentage of responses in each category. In fact, a frequency distribution table is a specific example of a probability distribution. For example, suppose a random sample of individuals in the United States were asked to identify their political affiliation using a 7-point response scale that ranged from extremely liberal to extremely conservative, with higher values reflecting a more liberal political affiliation. Table $2.1$ is a frequency distribution table summarizing the (hypothetical) responses.

Note that the probabilities depicted in the table are also the proportions, computed by simply dividing the frequency of responses in a particular category by the total number of respondents (e.g., the proportion of those who are extremely liberal is $30 / 1443=.021$ ), and the percentages depicted in the table can be obtained by multiplying these values by 100 (e.g., the percentage of those who are extremely liberal is $(100)(.021)=2.1 \%$ ). More formally, if the frequency is denoted by $f$ and the total number of respondents is denoted by $N$, then the probability or proportion is $\frac{f}{N}$ and the percentage is $100\left(\frac{f}{N}\right) \%$. Note also that the frequency can be obtained from the proportion (or probability) by $f=N$ (proportion).

A frequency distribution table such as the one depicted in Table $2.1$ summarizes the data obtained so that a researcher can easily determine, for example, that respondents were most likely to consider their political affiliation to be moderate and least likely to consider themselves to be extremely liberal. Yet, how might these data be summarized more succinctly?
Descriptive statistics, such as the mean and standard deviation, can be used to summarize discrete variables just as they can for continuous variables, although the manner in which these descriptive statistics are computed differs with categorical data. In addition, because it is no longer appropriate to assume that a normal distribution is the underlying random mechanism that produces the categorical responses in a population, distributions appropriate to categorical data must be used for inferential statistics with these data (just as the normal distribution is commonly the appropriate distribution used for inferential statistics with continuous data). The two most common probability distributions assumed to underlie responses in the population when data are categorical are the binomial distribution and the Poisson distribution, although there are also other distributions that are appropriate for categorical data. The remainder of this chapter is devoted to introducing and describing common probability distributions appropriate for categorical data.

统计代写|属性数据分析作业代写analysis of categorical data代考|The Hypergeometric Distribution

The hypergeometric distribution can be used with discrete variables to determine the number of “successes” in a sequence of $n$ draws from a finite population where sampling is conducted without replacement. It should be noted that “success” is simply a label for the occurrence of a particular event of interest. For example, suppose the admissions committee at a medical college has 15 qualified applicants, 3 of which are minority applicants, and can only admit 2 new students. In this case, we might define success as the admission of a minority applicant, and the hypergeometric distribution could then be used to determine the probability of admitting at least one minority student if the admissions committee were to randomly select two new students from the 15 qualified applicants.

In general, if $Y$ denotes the total number of successes desired (e.g., minority students selected for admission), $m$ denotes the number of possible successes (e.g., minority applicants), $N$ denotes the total sample size (e.g., number of qualified applicants), and $n$ denotes the total number of draws (e.g., number of applicants to be admitted), then the probability that $Y=k$ (where $k$ is a specific number of successes) can be expressed by
$$
P(Y=k)=\frac{\left(\begin{array}{c}
m \
k
\end{array}\right)\left(\begin{array}{c}
N-m \
n-k
\end{array}\right)}{\left(\begin{array}{c}
N \
n
\end{array}\right)}
$$

individuals (or objects) can be selected from a total of $m$ individuals (or objects). This is computed as
$$
\left(\begin{array}{c}
m \
k
\end{array}\right)=\frac{m !}{k !(m-k) !}=\frac{m(m-1)(m-2) \cdots(1)}{[k(k-1)(k-2) \cdots(1)[(m-k)(m-k-1)(m-k-2) \cdots(1)]}
$$
Note that 1 ! and 0 ! (1 factorial and 0 factorial, respectively) are both defined to be equal to 1 , and any other factorial is the product of all integers from the number in the factorial to 1. For example, $m !=m(m-1)(m-2) \ldots 1$.
The general formulation in Equation $2.1$ can be conceptualized using the theory of combinatorics. The number of ways to select or choose $n$ objects (e.g., applicants) from the total from the total number of minority applicants, $m$, is $\left(\begin{array}{c}m \ k\end{array}\right)$. For each possible manner of choosing $k$ minority applicants from the total of $m$ minority applicants, there are $\left(\begin{array}{c}k-m \ n-k\end{array}\right)$ possible admitting $n$ applicants. Therefore, the number of ways to form a sample consisting of exactly $k$ minority applicants is
$$
\left(\begin{array}{c}
m \
k
\end{array}\right)\left(\begin{array}{c}
N-m \
n-k
\end{array}\right)
$$
Further, because each of the samples is equally likely, the probability of selecting exactly $k$ minority applicants is
$$
\frac{\left(\begin{array}{c}
m \
k
\end{array}\right)\left(\begin{array}{c}
N-m \
n-k
\end{array}\right)}{\left(\begin{array}{c}
N \
n
\end{array}\right)}
$$

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|Probability Distributions for Categorical Variables

我们从一个简单的例子开始，我们假设感兴趣的人群是一所中学的五年级班级，由 50 名学生组成：10 名女性和 40 名男性。在这种情况下，如果老师从班级中随机选择一名学生，则老师选择男学生的可能性要高于女学生。事实上，因为知道人口中男女学生的确切人数，所以可以确定随机选择男学生或女学生的确切概率。具体而言，任何特定结果的概率被定义为特定结果在可能结果总数中出现的方式数量；因此，本例中随机选择
8 个概率分布的
男学生的概率为40/50=0.8，随机选择一个女学生的概率为10/50=0.2.

然而，与五年级学生群体的例子相反，知道群体的确切规格（即分布）是不典型的。推论统计程序的目标是从观察到的样本数据中推断总体，而不是相反。这是通过将作为某些实验或数据收集活动的结果获得的值视为可能已经发生的许多不同结果中的一个可能结果来实现的；也就是说，这个值是一个变量，因为它可以在不同的研究或实验中变化。例如，如果从前面讨论的五年级中随机选择 10 名学生，则可以使用该 10 名学生样本中的男性比例来推断更大的群体或总体（全部 50 名学生）中男性的比例。如果再随机抽取10个学生样本，男生的比例可能不等于第一个样本中的比例；从这个意义上说，男性的比例是一个变量。随机变量是一个术语，用于描述特定变量可能产生的可能结果。它不描述实际结果本身，也不能被赋值，而是用来传达这样一个事实，即获得的结果是一些潜在随机过程的结果。概率分布是一种数学函数，它将从实验或数据收集活动（例如，随机样本）的结果中获得的实际结果与其发生概率联系起来。男性的比例是一个变量。随机变量是一个术语，用于描述特定变量可能产生的可能结果。它不描述实际结果本身，也不能被赋值，而是用来传达这样一个事实，即获得的结果是一些潜在随机过程的结果。概率分布是一种数学函数，它将从实验或数据收集活动（例如，随机样本）的结果中获得的实际结果与其发生概率联系起来。男性的比例是一个变量。随机变量是一个术语，用于描述特定变量可能产生的可能结果。它不描述实际结果本身，也不能被赋值，而是用来传达这样一个事实，即获得的结果是一些潜在随机过程的结果。概率分布是一种数学函数，它将从实验或数据收集活动（例如，随机样本）的结果中获得的实际结果与其发生概率联系起来。而是用来传达这样一个事实，即获得的结果是一些潜在随机过程的结果。概率分布是一种数学函数，它将从实验或数据收集活动（例如，随机样本）的结果中获得的实际结果与其发生概率联系起来。而是用来传达这样一个事实，即获得的结果是一些潜在随机过程的结果。概率分布是一种数学函数，它将从实验或数据收集活动（例如，随机样本）的结果中获得的实际结果与其发生概率联系起来。

大多数处理连续因变量的方法都假设获得的值是来自正态分布的随机观察值。换句话说，当因变量是连续的时，假设正态分布是从中获得变量的总体中的潜在随机过程。但是，还有许多其他概率分布，当因变量是分类变量时，不能再假设它是从正态分布的总体中获得的。本章的目的是描述几个概率分布，这些概率分布被假定为从中获得分类数据的总体的基础。

统计代写|属性数据分析作业代写analysis of categorical data代考|Frequency Distribution Tables for Discrete Variables

离散变量是只能取有限数量值的变量。分类数据几乎总是由离散变量组成。总结此类数据的一种方法是构建频率分布表，该表描述了测量尺度的每个类别中的响应数量以及特定响应类别的发生概率和每个类别中响应的百分比。实际上，频率分布表是概率分布的一个具体例子。例如，假设一个随机样本的美国个人被要求使用从极端自由到极端保守的 7 点响应量表来确定他们的政治派别，较高的值反映了更自由的政治派别。桌子2.1是总结（假设）响应的频率分布表。

请注意，表中描述的概率也是比例，通过简单地将特定类别中的响应频率除以受访者总数来计算（例如，那些极端自由的人的比例是30/1443=.021)，而表中描述的百分比可以通过将这些值乘以 100 获得（例如，极端自由主义者的百分比是(100)(.021)=2.1%）。更正式地说，如果频率表示为F受访者总数表示为ñ，则概率或比例为Fñ百分比是100(Fñ)%. 另请注意，频率可以通过以下方式从比例（或概率）中获得F=ñ（部分）。

频率分布表，如表中所示2.1总结获得的数据，以便研究人员可以轻松确定，例如，受访者最有可能认为他们的政治派别是温和的，最不可能认为自己是极端自由的。然而，如何才能更简洁地总结这些数据？
描述性统计量（例如均值和标准差）可用于汇总离散变量，就像它们可以用于连续变量一样，尽管这些描述性统计量的计算方式因分类数据而异。此外，由于不再适合假设正态分布是在总体中产生分类响应的潜在随机机制，因此必须使用适合于分类数据的分布来对这些数据进行推论统计（正如正态分布是通常是用于具有连续数据的推论统计的适当分布）。当数据是分类数据时，假设作为总体响应基础的两个最常见的概率分布是二项式分布和泊松分布，尽管还有其他适用于分类数据的分布。本章的其余部分致力于介绍和描述适用于分类数据的常见概率分布。

统计代写|属性数据分析作业代写analysis of categorical data代考|The Hypergeometric Distribution

超几何分布可以与离散变量一起使用来确定一系列“成功”的数量n从一个有限的人口中抽取，其中抽样是在没有放回的情况下进行的。应该注意的是，“成功”只是一个特定感兴趣事件发生的标签。例如，假设一所医学院的招生委员会有 15 名合格的申请者，其中 3 人是少数族裔申请者，并且只能录取 2 名新生。在这种情况下，我们可以将成功定义为少数族裔申请人的录取，如果招生委员会从 15 名合格的学生中随机选择两名新生，那么超几何分布可以用来确定至少录取一名少数族裔学生的概率。申请人。

一般来说，如果是表示期望的成功总数（例如，选择录取的少数民族学生），米表示可能成功的数量（例如，少数申请人），ñ表示总样本量（例如，合格申请人的数量），并且n表示抽签的总数（例如，被录取的申请人数），那么概率是=到（在哪里到是特定的成功次数）可以表示为
磷(是=到)=(米到)(ñ−米 n−到)(ñ n)

个人（或对象）可以从总共米个人（或物体）。这计算为
(米到)=米!到!(米−到)!=米(米−1)(米−2)⋯(1)[到(到−1)(到−2)⋯(1)[(米−到)(米−到−1)(米−到−2)⋯(1)]
请注意 1 ！和 0 ！（分别为 1 阶乘和 0 阶乘）都定义为等于 1 ，并且任何其他阶乘是从阶乘中的数字到 1 的所有整数的乘积。例如，米!=米(米−1)(米−2)…1.
方程中的一般公式2.1可以用组合学的理论来概念化。选择或选择的方式数n对象（例如，申请人）来自少数申请人总数的总数，米，是(米到). 对于每一种可能的选择方式到少数申请人从总数米少数申请人，有(到−米 n−到)可能承认n申请人。因此，形成一个样本的方式的数量正好由到少数申请人是
(米到)(ñ−米 n−到)
此外，由于每个样本的可能性相同，因此准确选择的概率到少数申请人是
(米到)(ñ−米 n−到)(ñ n)

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|Organization of This Book

Posted on 2022年4月13日2022年4月13日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|Organization of This Book

Given the fact that most of the groundwork for categorical data analysis was developed in the early part of the 20 th century, the procedures presented in this book are relatively new. Indeed, it was not until the middle of the 20th century that strong theoretical advances were made in the field, and clearly there is still more work to be done. In this book, we chose to present a few of the more widely used analytic procedures for categorical data in great detail, rather than inundating the reader with all of the models that can be used for categorical data and their associated nuances. The primary goal of this book is to help social scientists develop a conceptual understanding of the categorical data analytic techniques presented. Therefore, while extensive training in mathematics will certainly be of benefit to the reader, a lack of it should not prevent students and researchers from understanding and applying these statistical procedures. This is accomplished by utilizing examples that are reflective of realistic applications of data analytic techniques in the social sciences, and by emphasizing specific research questions that can be addressed by each analytic procedure.

This book begins by introducing the reader to the different types of distributions that most often underpin categorical variables. This is followed by a discussion of the estimation procedures and goodness-of-fit tests that are used with the subsequent categorical data analytical procedures. Procedures designed to analyze the relationship between two categorical variables are then presented, followed by a discussion of procedures designed to analyze the relationships among three categorical variables. This second half of the book presents models for categorical data, beginning with an overview of the generalized linear model. Specific applications of the generalized linear model are then presented in chapters on log-linear models, binomial logistic regression models, and multinomial logistic regression models.

统计代写|属性数据分析作业代写analysis of categorical data代考|Summary

In this chapter, we introduced the reader to the types of research questions that can be addressed with statistical procedures designed to analyze categorical data. We gave a brief history on the development of these procedures, discussed scales of measurement, and provided the readers with the organizational structure of this book. In the next chapter, we turn to the different distributions that are assumed to underlie categorical data.
Problems
1.1 Indicate the scale of measurement used for each of the following variables and explain your answer by describing the probable scale:
a. Sense of belongingness, as measured by a 20 -item scale.
b. Satisfaction with life, as measured by a 1 -item scale.
c. Level of education, as measured by a demographic question with five categories.
$1.2$ Indicate the scale of measurement used for each of the following variables and explain your answer by describing the probable scale:
a. Self-efficacy, as measured by a 10-item scale.
b. Race, as measured by a demographic question with six categories.
c. Income, as measured by yearly gross income.
$1.3$ For each of the following research scenarios, identify the dependent and independent variables (or indicate if not applicable) as well as the scale of measurement used for each variable. Explain your answers by describing the scale that might have been used to measure each variable.
a. A researcher would like to determine if boys are more likely than girls to be proficient in mathematics.
b. A researcher would like to determine if people in a committed relationship are more likely to be satisfied with life than those who are not in a committed relationship.
c. A researcher is interested in whether females tend to have lower self-esteem, in terms of body image, than males.
d. A researcher is interested in predicting religious affiliation from level of education.
1.4 For each of the following research scenarios, identify the dependent and independent variables (or indicate if not applicable) as well as the scale of measurement used for each variable. Explain your answers by describing the scale that might have been used to measure each variable.
a. A researcher would like to determine if people living in the United States are more likely to be obese than people living in France.
b. A researcher would like to determine if the cholesterol levels of men who suffered a heart attack are higher than the cholesterol levels of women who suffered a heart attack.
c. A researcher is interested in whether gender is related to political party affiliation.
d. A researcher is interested in the relationship between amount of sleep and grade point average for high school students.

统计代写|属性数据分析作业代写analysis of categorical data代考|Determine

$1.5$ Determine whether procedures for analyzing categorical data are needed to address each of the following research questions. Provide a rationale for each of your answers by

identifying the dependent and independent variables as well as the scale that might have been used to measure each variable.
a. A researcher would like to determine whether a respondent will vote for the Republican or Democratic candidate in the US presidential election based on the respondent’s annual income.
b. A researcher would like to determine whether respondents who vote for the Republican candidate in the US presidential election have a different annual income than those who vote for the Democratic candidate.
c. A researcher would like to determine whether males who have suffered a heart attack have higher fat content in their diets than males who have not suffered a heart attack in the past six months.
d. A researcher would like to predict whether a man will suffer a heart attack in the next six months based on the fat content in his diet.
1.6 Determine whether procedures for analyzing categorical data are needed to address each of the following research questions. Provide a rationale for each of your answers by identifying the dependent and independent variables as well as the scale that might have been used to measure each variable.
a. A researcher would like to determine whether a student will complete high school based on the student’s grade point average.
b. A researcher would like to determine whether students who complete high school have a different grade point average than students who do not complete high school.
c. A researcher would like to determine whether the families of students who attend college have a higher annual income than the families of students who do not attend college.
d. A researcher would like to determine whether a student will attend college based on his or her family’s annual income.
1.7 Determine whether procedures for analyzing categorical data are needed to address each of the following research questions. Indicate what analytic procedure (e.g., ANOVA, regression) you would use for those cases that do not require categorical methods, and provide a rationale for each of your answers.
a. A researcher would like to determine if scores on the verbal section of the SAT can be used to predict whether students are proficient in reading on a state-mandated test administered in 12th grade.
b. A researcher is interested in whether income differs by gender.
c. A researcher is interested in whether level of education can be used to predict income.
d. A researcher is interested in the relationship between political party affiliation and gender.
$1.8$ Provide a substantive research question that would need to be addressed using procedures for categorical data analysis. Be sure to specify how the dependent and independent variables would be measured, and identify the scales of measurement used for these variables.

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|Organization of This Book

鉴于分类数据分析的大部分基础工作都是在 20 世纪初期开发的，因此本书中介绍的程序相对较新。事实上，直到 20 世纪中叶，该领域才取得了重大的理论进展，显然还有更多工作要做。在本书中，我们选择非常详细地介绍一些更广泛使用的分类数据分析程序，而不是用所有可用于分类数据的模型及其相关的细微差别来淹没读者。本书的主要目标是帮助社会科学家对所介绍的分类数据分析技术有一个概念性的理解。因此，虽然广泛的数学训练肯定会对读者有益，缺乏它不应妨碍学生和研究人员理解和应用这些统计程序。这是通过利用反映数据分析技术在社会科学中的实际应用的示例，并通过强调每个分析程序可以解决的具体研究问题来实现的。

本书首先向读者介绍了最常支持分类变量的不同类型的分布。随后讨论了与后续分类数据分析程序一起使用的估计程序和拟合优度检验。然后介绍了旨在分析两个分类变量之间关系的程序，然后讨论了旨在分析三个分类变量之间关系的程序。本书的后半部分介绍了分类数据的模型，首先概述了广义线性模型。然后在对数线性模型、二项式逻辑回归模型和多项式逻辑回归模型的章节中介绍了广义线性模型的具体应用。

统计代写|属性数据分析作业代写analysis of categorical data代考|Summary

在本章中，我们向读者介绍了可以通过旨在分析分类数据的统计程序来解决的研究问题类型。我们简要介绍了这些程序的发展历史，讨论了测量尺度，并为读者提供了本书的组织结构。在下一章中，我们将讨论假定为分类数据基础的不同分布。
问题
1.1 指出以下每个变量的测量尺度，并通过描述可能的尺度来解释你的答案：
a. 归属感，以 20 项量表衡量。
湾。对生活的满意度，以 1 项量表衡量。
C。教育水平，通过五个类别的人口统计问题来衡量。
1.2指出用于以下每个变量的测量尺度，并通过描述可能的尺度来解释你的答案：
a. 自我效能感，以 10 项量表衡量。
湾。种族，通过一个有六个类别的人口统计问题来衡量。
C。收入，按年度总收入衡量。
1.3对于以下每个研究场景，确定因变量和自变量（或在不适用时指出）以及用于每个变量的测量尺度。通过描述可能用于衡量每个变量的尺度来解释你的答案。
一种。一位研究人员想确定男孩是否比女孩更有可能精通数学。
湾。一位研究人员想确定处于忠诚关系中的人是否比不处于忠诚关系中的人更可能对生活感到满意。
C。一位研究人员感兴趣的是，就身体形象而言，女性是否比男性更容易自尊。
d。一位研究人员有兴趣从教育水平预测宗教信仰。
1.4 对于以下每个研究场景，确定因变量和自变量（或在不适用时注明）以及用于每个变量的测量尺度。通过描述可能用于衡量每个变量的尺度来解释你的答案。
一种。一位研究人员想确定居住在美国的人是否比居住在法国的人更容易肥胖。
湾。一位研究人员想确定心脏病发作男性的胆固醇水平是否高于心脏病发作女性的胆固醇水平。
C。研究人员对性别是否与政党隶属关系感兴趣。
d。一位研究人员对高中生的睡眠量与平均成绩之间的关系感兴趣。

统计代写|属性数据分析作业代写analysis of categorical data代考|Determine

1.5确定是否需要分析分类数据的程序来解决以下每个研究问题。通过以下方式为您的每个答案提供理由

识别因变量和自变量以及可能用于测量每个变量的尺度。
一种。一位研究人员想根据受访者的年收入来确定受访者在美国总统大选中是否会投票给共和党或民主党候选人。
湾。一位研究人员想确定在美国总统大选中投票给共和党候选人的受访者的年收入是否与投票给民主党候选人的受访者的年收入不同。
C。一位研究人员想确定心脏病发作的男性饮食中的脂肪含量是否高于过去六个月内没有心脏病发作的男性。
d。一位研究人员想根据他饮食中的脂肪含量来预测一个男人在未来六个月内是否会心脏病发作。
1.6 确定是否需要分析分类数据的程序来解决以下每个研究问题。通过确定因变量和自变量以及可能用于测量每个变量的量表，为您的每个答案提供理由。
一种。研究人员想根据学生的平均成绩来确定学生是否会完成高中。
湾。一位研究人员想确定完成高中的学生的平均成绩是否与未完成高中的学生不同。
C。一位研究人员想确定上大学的学生家庭的年收入是否高于未上大学的学生家庭。
d。研究人员想根据学生的家庭年收入来确定学生是否会上大学。
1.7 确定是否需要分析分类数据的程序来解决以下每个研究问题。指出你将使用什么分析程序（例如，ANOVA、回归）来处理那些不需要分类方法的情况，并为你的每个答案提供一个理由。
一种。一位研究人员想确定 SAT 口语部分的分数是否可以用来预测学生在 12 年级进行的国家规定的考试中是否能熟练阅读。
湾。一位研究人员对收入是否因性别而异感兴趣。
C。研究人员对教育水平是否可以用来预测收入感兴趣。
d。一位研究人员对政党归属与性别之间的关系感兴趣。
1.8提供需要使用分类数据分析程序解决的实质性研究问题。请务必指定如何测量因变量和自变量，并确定用于这些变量的测量尺度。

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

随机过程代考

贝叶斯方法代考

广义线性模型代考

广义线性模型（GLM）归属统计学领域，是一种应用灵活的线性回归模型。该模型允许因变量的偏差分布有除了正态分布之外的其它分布。

机器学习代写

多元统计分析代考

基础数据: $N$ 个样本， $P$ 个变量数的单样本，组成的横列的数据表
变量定性: 分类和顺序；变量定量：数值
数学公式的角度分为: 因变量与自变量

时间序列分析代写

回归分析代写

MATLAB代写

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写

统计代写|属性数据分析作业代写analysis of categorical data代考|Introduction and Overview

Posted on 2022年4月13日2023年10月17日 by statistics-lab

如果你也在怎样代写属性数据分析analysis of categorical data这个学科遇到相关的难题，请随时右上角联系我们的24/7代写客服。

我们提供的属性数据分析analysis of categorical data及其相关学科的代写，服务范围广, 其中包括但不限于:

Statistical Inference 统计推断
Statistical Computing 统计计算
Advanced Probability Theory 高等楖率论
Advanced Mathematical Statistics 高等数理统计学
(Generalized) Linear Models 广义线性模型
Statistical Machine Learning 统计机器学习
Longitudinal Data Analysis 纵向数据分析
Foundations of Data Science 数据科学基础

统计代写|属性数据分析作业代写analysis of categorical data代考|Introduction and Overview

统计代写|属性数据分析作业代写analysis of categorical data代考|What Is Categorical Data Analysis?

Categorical data arises whenever a variable is measured on a scale that simply classifies respondents into a limited number of groups or categories. For example, respondents’ race, gender, marital status, and political affiliation are categorical variables that are often of interest to researchers in the social sciences. In addition to distinguishing a variable as either categorical (qualitative) or continuous (quantitative), variables can also be classified as either independent or dependent. The term independent refers to a variable that is experimentally manipulated (e.g., the treatment group each person is assigned to) but is also often applied to a variable that is used to predict another variable even if it cannot be externally manipulated (e.g., socioeconomic status). The term dependent refers to a variable that is of primary interest as an outcome or response variable; for example, the outcome of a treatment (based on treatment group) or the educational achievement level (predicted from socioeconomic status) can be considered dependent variables. Introductory statistics courses may give the impression that categorical variables can only be used as independent variables; this is likely because the analytic procedures typically learned in these courses assume that the dependent variable follows a normal distribution in the population, and this is not the case for categorical variables. Nonetheless, treating categorical variables exclusively as independent variables can ultimately restrict the types of research questions posed by social science researchers.

For example, suppose you wanted to determine whether charter schools differed in any substantial way from non-charter schools based on the demographics of the school (e.g., location: urban, suburban, or rural; type: public or private; predominant socioeconomic status of students: low, medium, or high; and so on). You would be unable to study this phenomenon without knowledge of categorical data analytic techniques because all variables involved are categorical. As another example, suppose that a researcher wanted to predict whether a student will graduate from high school based on information such as the student’s attendance record (e.g., number of days in attendance), grade point average (GPA), income of parents, and so on. In this case, a categorical analysis approach would be more appropriate because the research question requires treating graduation status (yes or no) as the dependent variable. Indeed, a naive researcher might decide to use the graduation status as an independent variable, but this approach would not directly address the research question and would ultimately limit the results obtained from the analysis. The purpose of this book is to describe and illustrate analytic procedures that are applicable when the variables of interest are categorical.

统计代写|属性数据分析作业代写analysis of categorical data代考|Scales of Measurement

In general, measurement can be thought of as applying a specific rule to assign numbers to objects or persons for the sole purpose of differentiating between objects or persons on a particular attribute. For example, one might administer an aptitude test to a sample of college students to differentiate the students in terms of their ability. In this case, the specific rule being applied is the administration of the same aptitude test to all students. If one were to use different aptitude tests for different respondents, then the scores could not be compared across students in a meaningful way. For some variables, measurement can often be as precise as we want it to be; for example, we can measure the length of an object to the nearest centimeter, millimeter, or micromillimeter. For the variables typically measured in the social sciences, this is often not the case, and thus the only way to ensure quality measurement is to use instruments with good psychometric properties such as validity and reliability.

Measurement precision is typically defined by the presence or absence of the following four characteristics, presented in order in terms of the level of the information or precision they provide: (1) distinctiveness, (2) magnitude, (3) equal intervals, and (4) absolute zero. A measurement scale has the characteristic of distinctiveness if the numbers assigned to persons or objects simply differ on the property being measured. For example, if one were to assign a 0 to female respondents and a 1 to male respondents, then gender would be measured in a manner that had the characteristic of distinctiveness. A measurement scale has the characteristic of magnitude if the different numbers that are assigned to persons or objects can be ordered in a meaningful way based on their magnitude. For example, if one were to assign a score of 1 to a respondent who was very liberal, 2 to a respondent who was somewhat liberal, 3 to a respondent who was somewhat conservative, and 4 to a respondent who was very conservative, then political affiliation would be measured in a manner that had the characteristic of magnitude or rank ordering. A measurement scale has the characteristic of equal intervals if equivalent differences between two numbers assigned to persons or objects have an equivalent meaning. For example, if one were to consider examinees’scores from a particular reading test as indicative of reading proficiency, then, assuming that examinees’ scores were created by summing the number of items answered correctly on the test, reading proficiency would be measured in a manner that had the characteristic of magnitude. Note that a score of 0 on the reading test does not necessarily represent an examinee who has no reading ability. Rather, a score of 0 may simply imply that the test was too difficult, which might be the case if a second-grade student was to be given an eighth-grade reading test. This is an important distinction between

measurement scales that have the property of equal intervals and those that have the property of having an absolute zero.

A measurement scale has the characteristic of having an absolute zero if assigning a score of 0 to persons or objects indicates an absence of the attribute being measured. For example, if a score of 0 represents no spelling errors on a spelling exam, then number of spelling errors would be measured in a manner that had the characteristic of having an absolute zero.

Table $1.1$ indicates the four levels of measurement in terms of the four characteristics just described. Nominal measurement possesses only the characteristic of distinctiveness and can be thought of as the least precise form of measurement in the social sciences. Ordinal measurement possesses the characteristics of distinctiveness and magnitude and is a more precise form of measurement than nominal measurement. Interval measurement possesses the characteristics of distinctiveness, magnitude, and equal intervals and is a more precise form of measurement than ordinal measurement. Ratio measurement, which is rarely attained in the social sciences, possesses all four characteristics of distinctiveness, magnitude, equal intervals, and having an absolute zero and is the most precise form of measurement.

In categorical data analysis, the dependent or response variable, which represents the characteristic or phenomenon that we are trying to explain or predict in the population, is measured using either a nominal scale or an ordinal scale. Methods designed for ordinal variables make use of the natural ordering of the measurement categories, although the way in which we order the categories (i.e., from highest to lowest or from lowest to highest) is usually irrelevant. Methods designed for ordinal variables cannot be used for nominal variables. Methods designed for nominal variables will give the same results regardless of the order in which the categories are listed. While these methods can also be used for ordinal variables, doing so will result in a loss of information (and usually loss of statistical power) because the information about the ordering is lost.

In this book we will focus on methods designed for nominal variables. The analyses we will discuss can be used when all variables are categorical or when just the dependent variable is categorical. The independent or predictor variables (used to predict the dependent variable) can usually be measured using any of the four scales of measurement.

统计代写|属性数据分析作业代写analysis of categorical data代考|A Brief History of Categorical Methods

The early development of analytical methods for categorical data took place at the beginning of the 20th century and was spearheaded by the work of Karl Pearson and G. Udney Yule. As is typically the case when something new is introduced, the development of these procedures was not without controversy. While Pearson argued that categorical variables were simply

proxies of continuous variables, Yule argued that categorical variables were inherently discrete (Agresti, 1996). This in turn led the two statisticians to approach the problem of how to summarize the relationship between two categorical variables in vastly different ways. Pearson maintained that the relationship between two categorical variables could be approximated by the underlying continuum, and given his prestige in the statistical community, he was rarely challenged by his peers. Yule, however, challenged Pearson’s approach to the problem and developed a measure to describe the relationship between two categorical variables that did not rely on trying to approximate the underlying continuum (Yule, 1912). Needless to say, Pearson did not take kindly to Yule’s criticism and publicly denounced Yule’s approach, going so far as to say that Yule would have to withdraw his ideas to maintain any credibility as a statistician (Pearson \& Heron, 1913). One hundred years later, we have come to realize that both statisticians were partially correct. While some categorical variables, especially those that are measured in an ordinal manner, can be thought of as proxies to variables that are truly continuous, others cannot.

Pearson’s work was also critiqued by R. A. Fisher, who maintained that one of Pearson’s formulas was incorrect (Fisher, 1922). Even though statisticians eventually realized that Fisher was correct, it was difficult for Fisher to get his work published due to Pearson’s reputation in the field (Agresti, 1996). Moreover, while Pearson’s criticisms of Fisher’s work were published (Pearson, 1922), Fisher was unable to get his rebuttals to these criticisms published, ultimately leading him to resign from the Royal Statistical Society (Cowles, 2001). Although Fisher’s scholarly reputation among statisticians today is primarily due to other theoretical work, particularly in the area of ANOVA, he did make several contributions to the field of categorical data analysis, not the least of which is his approach to small sample techniques for analyzing categorical data.

属性数据分析

统计代写|属性数据分析作业代写analysis of categorical data代考|What Is Categorical Data Analysis?

每当一个变量被简单地划分为有限数量的组或类别时，就会出现分类数据。例如，受访者的种族、性别、婚姻状况和政治派别是社会科学研究人员经常感兴趣的分类变量。除了将变量区分为分类（定性）或连续（定量）之外，变量还可以分为独立变量或相关变量。术语独立是指通过实验操纵的变量（例如，每个人被分配到的治疗组），但也经常应用于用于预测另一个变量的变量，即使它不能被外部操纵（例如，社会经济地位）。依赖一词是指作为结果或响应变量主要感兴趣的变量；例如，治疗结果（基于治疗组）或教育成就水平（根据社会经济地位预测）可以被视为因变量。介绍性统计课程可能给人的印象是分类变量只能用作自变量；这很可能是因为在这些课程中通常学习的分析程序假设因变量在总体中服从正态分布，而分类变量并非如此。尽管如此，仅将分类变量视为自变量最终会限制社会科学研究人员提出的研究问题的类型。治疗的结果（基于治疗组）或教育成就水平（根据社会经济地位预测）可以被视为因变量。介绍性统计课程可能给人的印象是分类变量只能用作自变量；这很可能是因为在这些课程中通常学习的分析程序假设因变量在总体中服从正态分布，而分类变量并非如此。尽管如此，仅将分类变量视为自变量最终会限制社会科学研究人员提出的研究问题的类型。治疗的结果（基于治疗组）或教育成就水平（根据社会经济地位预测）可以被视为因变量。介绍性统计课程可能给人的印象是分类变量只能用作自变量；这很可能是因为在这些课程中通常学习的分析程序假设因变量在总体中服从正态分布，而分类变量并非如此。尽管如此，仅将分类变量视为自变量最终会限制社会科学研究人员提出的研究问题的类型。介绍性统计课程可能给人的印象是分类变量只能用作自变量；这很可能是因为在这些课程中通常学习的分析程序假设因变量在总体中服从正态分布，而分类变量并非如此。尽管如此，仅将分类变量视为自变量最终会限制社会科学研究人员提出的研究问题的类型。介绍性统计课程可能给人的印象是分类变量只能用作自变量；这很可能是因为在这些课程中通常学习的分析程序假设因变量在总体中服从正态分布，而分类变量并非如此。尽管如此，仅将分类变量视为自变量最终会限制社会科学研究人员提出的研究问题的类型。

例如，假设您想根据学校的人口统计数据（例如，位置：城市、郊区或农村；类型：公立或私立；主要社会经济地位学生：低、中或高；等等）。如果不了解分类数据分析技术，您将无法研究这种现象，因为所涉及的所有变量都是分类的。再举一个例子，假设研究人员想根据学生的出勤记录（例如，出勤天数）、平均绩点（GPA）、父母收入等信息来预测学生是否会从高中毕业。很快。在这种情况下，分类分析方法会更合适，因为研究问题需要将毕业状态（是或否）视为因变量。事实上，一个天真的研究人员可能会决定使用毕业状态作为自变量，但这种方法不会直接解决研究问题，最终会限制从分析中获得的结果。本书的目的是描述和说明当感兴趣的变量是分类变量时适用的分析程序。

统计代写|属性数据分析作业代写analysis of categorical data代考|Scales of Measurement

一般来说，测量可以被认为是应用特定规则为对象或人分配数字，其唯一目的是在特定属性上区分对象或人。例如，可以对大学生样本进行能力倾向测试，以区分学生的能力。在这种情况下，适用的具体规则是对所有学生进行相同的能力测试。如果要对不同的受访者使用不同的能力倾向测试，那么学生之间的分数就无法以有意义的方式进行比较。对于某些变量，测量通常可以像我们希望的那样精确；例如，我们可以将物体的长度测量到最接近的厘米、毫米或微毫米。对于社会科学中通常测量的变量，

测量精度通常由以下四个特征的存在或不存在定义，这些特征按照它们提供的信息或精度的水平顺序呈现：（1）独特性，（2）幅度，（3）等间隔，和（ 4) 绝对零。如果分配给人或物体的数字仅在被测量的属性上有所不同，则测量量表具有独特性的特征。例如，如果将 0 分配给女性受访者，将 1 分配给男性受访者，那么性别将以具有独特性特征的方式进行衡量。如果分配给人或物体的不同数字可以根据它们的大小以有意义的方式排序，则测量尺度具有大小的特征。例如，如果给一个非常自由的受访者打 1 分，给有点自由的受访者打 2 分，给有点保守的受访者打 3 分，给非常保守的受访者打 4 分，那么政治派别将是以具有大小或等级排序特征的方式测量。如果分配给人或物体的两个数字之间的等效差异具有等效的含义，则测量尺度具有等间隔的特征。例如，如果将考生在特定阅读测试中的分数视为阅读能力的指标，那么，假设考生的分数是通过将测试中正确回答的项目数相加得出的，那么阅读能力将以具有量级特征的方式。请注意，阅读测试的分数为 0 并不一定代表考生没有阅读能力。相反，0 分可能只是意味着测试太难了，如果要对二年级学生进行八年级阅读测试，可能就是这种情况。这是一个重要的区别

具有等间隔特性的测量尺度和具有绝对零特性的测量尺度。

如果将 0 分给人或对象表示不存在被测量的属性，则测量量表具有绝对为零的特性。例如，如果在拼写考试中得分为 0 表示没有拼写错误，那么拼写错误的数量将以具有绝对为零的特征的方式进行测量。

桌子1.1表示根据刚刚描述的四个特征的四个测量级别。名义计量只具有显着性的特点，可以被认为是社会科学中最不精确的计量形式。序数计量具有显着性和量级的特点，是比名义计量更精确的计量形式。区间测量具有显着性、大小和等区间的特点，是一种比序数测量更精确的测量形式。比率测量在社会科学中很少实现，它具有独特性、大小、等间隔和绝对零四个特征，是最精确的测量形式。

在分类数据分析中，代表我们试图在总体中解释或预测的特征或现象的因变量或响应变量是使用名义尺度或有序尺度来衡量的。为序数变量设计的方法利用了测量类别的自然排序，尽管我们对类别进行排序的方式（即从最高到最低或从最低到最高）通常是无关紧要的。为序数变量设计的方法不能用于名义变量。无论类别列出的顺序如何，为名义变量设计的方法都将给出相同的结果。虽然这些方法也可以用于序数变量，

在本书中，我们将专注于为名义变量设计的方法。当所有变量都是分类变量或仅因变量是分类变量时，可以使用我们将讨论的分析。自变量或预测变量（用于预测因变量）通常可以使用四种测量尺度中的任何一种来测量。

统计代写|属性数据分析作业代写analysis of categorical data代考|A Brief History of Categorical Methods

分类数据分析方法的早期发展发生在 20 世纪初，由 Karl Pearson 和 G. Udney Yule 的工作带头。与引入新事物时的典型情况一样，这些程序的开发并非没有争议。虽然 Pearson 认为分类变量只是

作为连续变量的代理，Yule 认为分类变量本质上是离散的（Agresti，1996）。这反过来又导致两位统计学家解决了如何以截然不同的方式总结两个分类变量之间的关系的问题。Pearson 坚持认为，两个分类变量之间的关系可以通过基本的连续统一体来近似，并且鉴于他在统计界的声望，他很少受到同行的挑战。然而，Yule 对 Pearson 解决问题的方法提出了挑战，并开发了一种方法来描述两个分类变量之间的关系，而不依赖于试图逼近潜在的连续统一体 (Yule, 1912)。毋庸置疑，皮尔逊对尤尔的批评并不友好，公开谴责尤尔的做法，甚至说尤尔将不得不撤回他的想法以保持作为统计学家的任何可信度（Pearson \ & Heron，1913）。一百年后，我们开始意识到两位统计学家都是部分正确的。虽然一些分类变量，特别是那些以序数方式测量的变量，可以被认为是真正连续变量的代理，但其他变量则不能。

Pearson 的工作也受到 RA Fisher 的批评，他认为 Pearson 的公式之一是不正确的（Fisher，1922）。尽管统计学家最终意识到费舍尔是正确的，但由于皮尔逊在该领域的声誉，费舍尔很难发表他的作品（Agresti，1996 年）。此外，虽然皮尔逊对费舍尔工作的批评已发表（皮尔森，1922 年），但费舍尔无法发表对这些批评的反驳，最终导致他从皇家统计学会辞职（考尔斯，2001 年）。虽然费舍尔今天在统计学家中的学术声誉主要归功于其他理论工作，特别是在 ANOVA 领域，但他确实对分类数据分析领域做出了一些贡献，

统计代写请认准statistics-lab™. statistics-lab™为您的留学生涯保驾护航。统计代写|python代写代考

R语言代写	问卷设计与分析代写
PYTHON代写	回归分析与线性模型代写
MATLAB代写	方差分析与试验设计代写
STATA代写	机器学习/统计学习代写
SPSS代写	计量经济学代写
EVIEWS代写	时间序列分析代写
EXCEL代写	深度学习代写
SQL代写	各种数据建模与可视化代写