统计代写|实验设计作业代写experimental design代考|CORRELATION FORM

When the main concern is to decide which variables to include in the model, a very useful transformation of the data is to scale each variable, predictors and dependent variables alike, so that the normal equations can be written in correlation form. This enables us to identify important variables which should be included in the model and it also reveals some of the dependenoles between the predictor variables.

As usual, we consider the variables to be in deviation form. The correlation coefficient between $x_{1}$ and $x_{2}$ is
$$\left.\left.r_{12}=s_{12} / \sqrt{\left(s_{11}\right.} s_{22}\right)=\sum x_{1} x_{2} / \sqrt{\left(s_{11}\right.} s_{22}\right)$$
If we divide each variable $x_{1}$ by $\sqrt{S}{11}$ and denote the result as $$x{1}^{}=x_{1} / \sqrt{s}{1 i}$$ then $x{i}^{}$ is said to be in correlation form. Notice that
$$\Sigma x_{i}^{}=0$$ $$\Sigma\left(x_{i}^{}\right)^{2}=1$$
$$\Sigma x_{i}^{} x_{j}^{}=r_{1 j}$$
We have transformed the model from

$$y=B_{1} x_{1}+B_{2} x_{2}+\varepsilon \text { to } y^{}=\alpha_{1} x_{1}^{}+\alpha_{2} x_{2}^{*}+\varepsilon$$
and the normal equations simplify from
\begin{aligned} &s_{11} b_{1}+s_{12} b_{2}=s_{y 1} \ &s_{12} b_{1}+s_{22} b_{2}=s_{y 2} \end{aligned} \text { to } \quad r_{12} a_{1}+r_{12}+a_{2}=r_{y 1}=r_{y 2} \quad \text { (3.5.3) }

统计代写|实验设计作业代写experimental design代考|VARIABLE SELECTION – ALL POSSIBLE REGRESSIONS

In many situations, researchers know which variables may be included in the predictor model. There is some advantage in reducing the number of predictor variables to form a more parsimonious model. One way to achieve this is to run all possible regressions and to consider such statistics as the coefficient of determination, $R^{2}=$ SSR/SST.
We will use the heart data of Section 3.5, again relabelling the variables as A through $F$. With the variables in correlation form, $R^{2}=S S R$, the sum of squares for regression, and this is given for each possible combination of predictor variables in Table $3.6 .1$.

To assist the choice of the best subset, C.L. Mallows suggested fitting all possible models and evaluating the statistic
$$C_{p}=S S E_{p} / s^{2}-(n-2 p)$$
Here, $n$ is the number of observations and $p$ is the number of predictor variables in the subset, including a constant term. For each subset, the value of Mallows’ statistio can be evaluated from the correponding value of SSR. The complete set of these statistics are listed in Table 3.6.2. For each subset we use the mean squared error, MSE, of the full model as an estimate of the variance.

Suppose that the true model has q predictor variables.

统计代写|实验设计作业代写experimental design代考|VARIABLE SELECTION – SEQUENTIAL METHODS

When the number of possible variables in a model is large, it may be inappropriate to run every possible regression and evaluate Mallows’ statistic for each one, even though short cuts can be taken to evaluate such statistios by adding or subtracting terms rather than by evaluating each one from scratch.

Another approach is to add, or remove, variables, sequentially. We have seen that adding a variable will increase SSR, the sum of squares for regression. From Section $3.4$ we could perform an F-test to decide if the increase in SSR is si gnificant. The first method we consider is that of forward selection.

r12=s12/(s11s22)=∑X1X2/(s11s22)

ΣX一世=0Σ(X一世)2=1
ΣX一世Xj=r1j

$$y=B_{1} x_{1}+B_{2} x_{2}+\varepsilon \text { to } y^{ }=\alpha_{1} x_{1}^{ }+\alpha_{ 2} x_{2}^{*}+\varepsilon 一种nd吨H和n这r米一种l和q在一种吨一世这nss一世米pl一世F是的Fr这米 s11b1+s12b2=s是的1 s12b1+s22b2=s是的2\text { to } \quad r_{12} a_{1}+r_{12}+a_{2}=r_{y 1}=r_{y 2} \quad \text { (3.5.3) }$$

Cp=小号小号和p/s2−(n−2p)

