## 统计代写|回归分析作业代写Regression Analysis代考|STAT311

## 统计代写|回归分析作业代写Regression Analysis代考|The Independence Assumption and Repeated Measurements

You know what? All the analyses we did on the charitable contributions prior to the subject/indicator variable model were grossly in error because the independence assumption was so badly violated. You may assume, nearly without question, that these 47 taxpayers are independent of one another. But you may not assume that the repeated observations on a given taxpayer are independent. Charitable behavior in different years is similar for given taxpayers; i.e., the observations are dependent rather than independent. It was wrong for us to assume that there were 470 independent observations in the data set. As you recall, the standard error formula has an ” $n$ ” in the denominator, so it makes a big difference whether you use $n=470$ or $n=47$. In particular, all the standard errors for models prior to the analysis above were too small.

Sorry about that! We would have warned you that all those analyses were questionable earlier, but there were other points that we needed to make. Those were all valid points for cases where the observations are independent, so please do not forget what you learned.
But now that you know, please realize that you must consider the dependence issue carefully. You simply cannot, and must not, treat repeated observations as independent. All of the standard errors will be grossly incorrect when you assume independence; the easiest way to understand the issue is to recognize that $n=470$ is quite a bit different from $n=47$.
Confused? Simulation to the rescue! The following R code simulates and analyzes data where there are 3 subjects, with 100 replications on each, and with a strong correlation (similarity) of the data on each subject.
\begin{aligned} & \mathrm{s}=3 \quad # \text { subjects } \ & r=100 \quad # \text { replications within subject } \ & \mathrm{X}=\operatorname{rnorm}(\mathrm{s}) ; \mathrm{X}=\operatorname{rep}(\mathrm{X}, \text { each }=r) \text { +rnorm }\left(r^{\star} s, 0, .001\right) \ & \mathrm{a}=\operatorname{rnorm}(\mathrm{s}) ; \mathrm{a}=\operatorname{rep}(\mathrm{a}, \text { each }=r) \end{aligned}

$e=\operatorname{rnorm}(s \star r, 0, .001)$
epsilon $=\mathrm{a}+\mathrm{e}$
$\mathrm{Y}=0+0 \star \mathrm{X}+\operatorname{rnorm}\left(\mathrm{S}^* \mathrm{r}\right)$ tepsilon # $\mathrm{Y}$ unrelated to $\mathrm{X}$
sub $=\operatorname{rep}(1: s$, each $=r)$
summary $(\operatorname{lm}(\mathrm{Y} \sim \mathrm{X}))$ # Highly significant $\mathrm{X}$ effect
$\operatorname{summary}(\operatorname{lm}(\mathrm{Y} \sim \mathrm{X}+$ as.factor $($ sub $)))$ # Insignificant $\mathrm{X}$ effect

## 统计代写|回归分析作业代写Regression Analysis代考|Predicting Hans’ Graduate GPA: Theory Versus Practice

Hans is applying for graduate school at Calisota Tech University (CTU). He sends CTU his quantitative score on the GRE entrance examination $\left(X_1=140\right)$, his verbal score on the $\operatorname{GRE}\left(X_2=160\right)$, and his undergraduate GPA $\left(X_3=2.7\right)$. What would be his final graduate GPA at CTU?

Of course, no one can say. But what we do know, from the Law of Total Variance discussed in Chapter 6, is that the variance of the conditional distribution of $Y=$ final CTU GPA is smaller on average when you consider additional variables. Specifically,
$$\mathrm{E}\left{\operatorname{Var}\left(Y \mid X_1, X_2, X_3\right)\right} \leq \mathrm{E}\left{\operatorname{Var}\left(Y \mid X_1, X_2\right)\right} \leq \mathrm{E}\left{\operatorname{Var}\left(Y \mid X_1\right)\right}$$

Figure 11.1 shows how these inequalities might appear, as they relate to Hans. The variation in potentially observable GPAs among students who are like Hans in that they have GRE Math $=140$ is shown in the top panel. Some of that variation is explained by different verbal abilities among students, and the second panel removes that source of variation by considering GPA variation among students who, like Hans, have GRE Math $=140$, and GRE Verbal $=160$. But some of that variation is explained by the general student diligence. Assuming undergraduate GPA is a reasonable measure of such “diligence,” the final panel removes that source of variation by considering GPA variation among students who, like Hans, have GRE Math $=140$, and GRE verbal $=160$, and undergrad GPA $=2.7$. Of course, this can go on and on if additional variables were available, with each additional variable removing a source of variation, leading to distributions with smaller and smaller variances.

The means of the distributions shown in Figure 11.1 are $3.365,3.5$, and 3.44 , respectively. If you were to use one of the distributions to predict Hans, which one would you pick? Clearly, you should pick the one with the smallest variance. His ultimate GPA will be the same number under all three distributions, and since the third distribution has the smallest variance, his GPA will likely be closer to its mean (3.44) than to the other distribution means (3.365 or 3.5).

## 统计代写|回归分析作业代写Regression Analysis代考|Piecewise Linear Regression; Regime Analysis

Usually, it makes sense to model $\mathrm{E}(Y \mid X=x)$ as a continuous function of $x$, but there are cases where a discontinuity is needed. For a hypothetical example, suppose people with less than $\$ 250,000$income are taxed at$28 \%$, and those with$\$250,000$ or more are taxed at $34 \%$. Then a regression model to predict $Y=$ Charitable Contributions will likely have a discontinuity at $X=250,000$, as shown in Figure 10.12.

If you wanted to estimate the model shown in Figure 10.12, you would first create an indicator variable that is 0 for Income $<250$, otherwise 1 , like this:
Ind $=$ ifelse $($ Income $<250,0,1)$
Then you would include that variable in a regression model, with interactions, like this:
$$\text { Charity }=\beta_0+\beta_1 \text { Income }+\beta_2 \text { Ind }+\beta_3 \text { Income } \times \text { Ind }+\varepsilon$$
How can you understand this model? Once again, you must separate the model into the various subgroups. Here there are models in this example:
Group 1: Income $<250$
\begin{aligned} \text { Charity } & =\beta_0+\beta_1 \text { Income }+\beta_2(0)+\beta_3 \text { Income } \times(0)+\varepsilon \ & =\beta_0+\beta_1 \text { Income }+\varepsilon \end{aligned}
Group 2: Income $\geq 250$
\begin{aligned} \text { Charity } & =\beta_0+\beta_1 \text { Income }+\beta_2(1)+\beta_3 \text { Income } \times(1)+\varepsilon \ & =\left(\beta_0+\beta_2\right)+\left(\beta_1+\beta_3\right) \text { Income }+\varepsilon \end{aligned}
Thus, $\beta_0$ and $\beta_1$ are the intercept and slope of the model when Income $<250$, while $\left(\beta_0+\beta_2\right)$ and $\left(\beta_1+\beta_2\right)$ are the intercept and slope of the model when Income $\geq 250$.

## 统计代写|回归分析作业代写Regression Analysis代考|Relationship Between Commodity Price and Commodity Stockpile

The following data set contains government-reported annual numbers for price (Price) and stockpiles (Stocks) of a particular agricultural commodity in an Asian country.
URA-DataSets/master/Comm_Price.txt”)
attach(Comm)
Comm = read.table $($ https $: / /$ raw.githubusercontent. com/andrea $2719 /$
URA-DataSets/master/Comm_Price.txt”)
attach (Comm)
Figure 10.13 shows how the Stocks and Price have changed over time. Something happened in 2002 to the Stocks variable; perhaps a re-definition of the measurement in response to a policy change.

This abrupt shift in 2002 causes trouble in estimating the relationship between Price and Stocks, which would ordinarily be considered a negative one because of the laws of supply and demand. Figure 10.14 shows the (Stocks, Price) scatter, with data values before 2002 indicated by circles, as well as global and separate least-squares fits.

$\mathrm{R}$ code for Figure 10.14
pch = ifelse $($ Year $<2002,1,2)$ par (mfrow=c $(1,2))$ plot (Stocks, Price, pch=pch) abline (lsfit (Stocks, Price)) plot (Stocks, Price, pch=pch) abline (lsfit (Stocks [Year $<2002$ ], Price [Year<2002]), 1ty=1) abline (Isfit (Stocks [Year $>=2002$ ], Price [Year $>=2002$ ]), Ity=2)

## 统计代写|回归分析作业代写Regression Analysis代考|Does Location Affect House Price, Controlling for House Size?

Even though the realtors say “location, location, location!”, the observed effects of location on house price might simply be due to the fact that bigger homes tend to be in some locations. After all, square footage is a strong determinant of house price. To compare prices in different locations for homes of the same size, simply add “sqfeet” to the model like this:
attach(house)
fit.main = lm(sell $~$ location + sqfeet, data=house)
summary (fit.main)
house $=$ read.csv $($ https: $/ /$ raw.githubusercontent.com/andrea $2719 /$
attach (house)
fit.main $=1 \mathrm{~m}($ sell $\sim$ location + sqfeet, data=house)
summary (fit.main)
The results are as follows:
Coefficients :
Estimate std. Error $t$ value $\operatorname{Pr}(>|t|)$
$\begin{array}{lllll}\text { (Intercept) } 25.898669 & 5.060777 \quad 5.118 \quad 3.67 \mathrm{e}-06 * * *\end{array}$
locationB $-21.106407 \quad 2.152655-9.8056 .41 \mathrm{e}-14 * \star *$
locationd $-21.431288 \quad 3.579304 \quad-5.988 \quad 1.43 e-07 \star \star * *$
locationd $-24.846429 \quad 2.574269 \quad-9.6521 .13 \mathrm{e}-13 \star \star *$
locatione $-27.304759 \quad 2.538505-10.7561 .94 \mathrm{e}-15 * k *$
sqfeet $\quad 0.0412240 .002578 \quad 15.993<2 e-16 * k$ Signif. Codes: 0 ‘‘ 0.001 ‘‘ 0.01 ‘*’ $0.05 ‘ y^{\prime} 0.1$ ‘ 1
Residual standard error: 6.638 on 58 degrees of freedom
Multiple R-squared: 0.874, Adjusted R-squared: 0.8631
F-statistic: 80.47 on 5 and 58 DF, p-value: $<2.2 e-16$

## 统计代写|回归分析作业代写Regression Analysis代考|Full Model versus Restricted Model $F$ Tests

As we have mentioned repeatedly, tests of hypotheses are not the best way to evaluate models and assumptions. However, the $F$ test that was introduced in Chapter 8 is so common in the history of ANOVA, ANCOVA, and regression that we would be remiss not to mention it.

Models such as those shown in Figures 10.7 and 10.6 are often compared by using the $F$ test, which is a test to compare “full” versus “restricted” classical regression models. (For models other than the classical regression model, full/restricted model comparison is more commonly done using the likelihood ratio test, which is used starting in Chapter 12 of this book.)
In the usual regression analysis, a full model typically has the form:
$$Y=\beta_0+\beta_1 X_1+\beta_2 X_2+\ldots+\beta_k X_k+\varepsilon$$
Here, the parameters $\beta_0, \beta_1, \beta_2, \ldots$, and $\beta_k$ are unconstrained; that is, each parameter can possibly take any value whatsoever between $-\infty$ and $\infty$, and the value that one $\beta$ parameter takes is not dependent on (or constrained by) the value that any other $\beta$ parameter takes.
A restricted model is the same model, but with constraints on the parameters. The most common restrictions are constraints such as $\beta_1=\beta_2=0$, although other constraints such as $\beta_2=1$, or $\beta_1-\beta_2=0$, or $\beta_0+15 \beta_2=100$ are also possible.

The separate slope model graphed in Figure 10.7 is a full model relative to the restricted model that constrains all the interaction $\beta^{\prime}$ s to be zero, shown in Figure 10.6. The $F$ test can be used to compare these models. To construct the $F$ test, let $\mathrm{SSE}{\mathrm{F}}$ denote the error sum of squares in the full model, and let $\mathrm{SSE}{\mathrm{R}}$ denote the error sum of squares in the restricted model. It is a mathematical fact that
$$\mathrm{SSE}{\mathrm{F}} \leq \mathrm{SSE}{\mathrm{R}}$$

