统计代写|贝叶斯分析代写Bayesian Analysis代考|Correlation Coefficient and p-Values

The correlation coefficient is a number between $-1$ and 1 that determines whether two paired sets of data (such as those for height and intelligence of a group of people) are related. The closer to 1 the more “confident” we are of a positive linear correlation and the closer to-1 the more confident we are of a negative linear correlation (which happens when, for example, one set of numbers tends to decrease when the other set increases as you might expect if you plotted a person’s age against the number of toys they possess). When the correlation coefficient is close to zero there is little evidence of any relationship.

Confidence in a relationship is formally determined not just by the correlation coefficient but also by the number of pairs in your data. If there are very few pairs then the coefficient needs to be very close to 1 or $-1$ for it to be deemed “statistically significant,” but if there are many pairs then a coefficient closer to 0 can still be considered “highly significant.”

The standard method that statisticians use to measure the “significance” of their empirical analyses is the $p$-value. Suppose we are trying to determine if the relationship between height and intelligence of people is significant and have data consisting of various pairs of values (height, intelligence) for a set of people; then we start with the “null hypothesis,” which, in this case is the statement “height and intelligence of people are unrelated.” The $p$-value is a number between 0 and 1 representing the probability that the data we have arisen if the null hypothesis were true. In medical trials the null hypothesis is typically of the form that “the use of drug X to treat disease $\mathrm{Y}$ is no better than not using the drug.”

The calculation of the $p$-value is based on a number of assumptions that are beyond the scope of this discussion, but people who need $p$-values can simply look them up in standard statistical tables (they are also computed automatically in Excel when you run Excel’s regression tool). The tables (or Excel) will tell you, for example, that if there are 100 pairs of data whose correlation coefficient is $0.254$, then the $p$-value is $0.01$. This means that there is a 1 in 100 chance that we would have seen these observations if the variables were unrelated.
A low $p$-value (such as $0.01$ ) is taken as evidence that the null hypothesis can be “rejected.” Statisticians say that a $p$-value of $0.01$ is “highly significant” or say that “the data is significant at the $0.01$ level.”

A competent researcher investigating a hypothesized relationship will set a $p$-value in advance of the empirical study. Typically, values of either $0.01$ or $0.05$ are used. If the data from the study results in a $p$-value of less than that specified in advance, the researchers will claim that their study is significant and it enables them to reject the null hypothesis and conclude that a relationship really exists.

统计代写|贝叶斯分析代写Bayesian Analysis代考|Spurious Correlations

Although the preceding examples illustrate the danger of reading too much into dubious correlations between variables, the relationships we saw there did not arise purely by chance. In each case some additional common factors helped explain the relationship.

But many studies, including unfortunately many taken seriously, result in claims of causal relationships that are almost certainly due to nothing other than pure chance.

Although nobody would seriously take measures to stop Americans drinking beer in order to reduce Japanese child mortality, barely a day goes by when some decision maker or another somewhere in the world takes just as irrational a decision based on correlations that turn out to be just as spurious.

For example, on the day we first happened to be drafting this section (16 March 2009) the media was buzzing with the story that working night shifts resulted in an increased risk of breast cancer. This followed a World Health Organization study and it triggered the Danish government to make compensation awards to breast cancer sufferers who had worked night shifts. It is impossible to state categorically whether this result really is an example of a purely spurious correlation. But it is actually very simple to demonstrate why and how you will inevitably find a completely spurious correlation in such a study-which you might then wrongly claim is a causal relationship-if you measure enough things.

