Posted by: W. E. Poplaski | April 1, 2009

## THE COMPLEAT EXPERIMENTER: 3. Experimentation and Power

THE  COMPLEAT  EXPERIMENTER

~ ♦ ~      ~ ♦ ~      ~ ♦ ~

O, sir, doubt not that experimenting is an art; is it not an art to tease out the native hue of resolution?

~ ♦ ~      ~ ♦ ~      ~ ♦ ~

3.  Experimentation and Power

The question all experimenters face is, “how do I avoid making Type I and Type II errors?”  There are two main ways for experimenters to minimize Type I and Type II errors.

Second, she can increase the sample size.  Increasing the sample size does not generally affect the standard deviation, but it does decrease the standard error of the mean (SEM). The SEM is used in calculating confidence intervals (large sample sizes make for small confidence intervals around each of the treatment means, which make it easier to distinguish between them).

http://people.hofstra.edu/Stefan_Waner/RealWorld/finitetopic1/confint.html

http://www.acponline.org/clinical_information/journals_publications/ecp/sepoct01/primerci.pdf

Power is the ability to correctly identify that the experimental treatment mean is significantly different from the control treatment mean (that is, the ability to avoid a type II error),and is expressed as a probability.  Increasing the experiment’s sample size increases its power to distinguish between the two means (just as increasing precision did in the ‘sloppy test grades’ example). By convention, 80% power is considered an adequate level for most experiments. These websites describe statistical power more fully—

http://www.indiana.edu/~statmath/stat/all/power/power.html#Size

So, what are the practical consequences of power to the experimenter? Imagine this simple experiment—

A scientist is attempting to create a new and improved formula for guppy feed. If she is successful—she strikes gold, because she will make a ton of money selling it. How will she decide if the new formula is better than the old one? By doing an experiment! She will base her decision on the weights of guppies that have been fed either the standard (treatment ‘A’) or new formula (treatment ‘B’) for two-months.

The null hypothesis (HO) is that there is no difference in the mean weights between treatment ‘A’ and ‘B’ guppies. The alternative hypothesis (HAlt) is that the mean weight of ‘B’ guppies is greater than that of ‘A’ guppies.

As is typical for many experiments, she sets the significance level, ‘α’, at 0.05. That means, if she were to repeat the experiment many times she would expect to see a significant weight gain by chance (as opposed to being caused by her new and improved formula) only 5% of the time (of course, that means 95% of the time the cause would be due to her formula). [BTW, α is the probability of making a Type I error, and ‘1- α’ (in this case, 0.95 or 95%) is the probability of correctly accepting HO. α is called the ‘significance level’; 1-α is called the ‘confidence level’. β is the probability of making a Type II error; ‘1-β’ is termed ‘statistical power’]

Now, imagine that the experiment is run 1000 times, and suppose we know that for 70%—i.e., 700—of the experiments HO will be the incorrect choice (of course, this is something we cannot know in advance, but we assume it for the sake of argument. Some statisticians call this a “God’s eye view” of the situation). Let’s set the power of these 1000 experiments at 80% and the significance level, α, at 0.05. With these settings, we expect 560 of the 700 experiments would correctly reject the null hypothesis (because 0.80 X 700= 560, here 700 is the number of experiments for which HAlt is actually the correct choice and 0.80 is the fraction of those we can expect to correctly identify). Likewise, we would falsely reject the null hypothesis in about 15 of the 300 experiments (because 0.05 X 300= 15, here 300 is the number of experiments for which HO is actually the correct choice and 0.05 is the fraction that we can expect to incorrectly identify); these 15 experimental results are termed ‘false-positives’ and are Type I errors.

So, for those experiments with positive results (i.e., experiments for which we decide to accept HAlt) the ratio of correct to incorrect decisions is 560:15 or ~37:1 (a 2.6% error rate). However, if the power had been set at, say, 20%—instead of 80%—that ratio becomes ~9:1 (a 10% error rate).

After going through the time and expense of carrying out an experiment, you would hope to always avoid the embarrassment of having a false-positive result (i.e., you would like to have a 0% error rate).  This is especially true in the field of medical diagnostics, where a false-positive might mean informing a patient that they have a disease when they are, in fact, healthy.

Unfortunately, there are trade-offs that make this impossible. For example, to get close to a 0% error rate, you would need an impossibly large sample size.

In order to do an experiment with a manageable sample size you must accept an increase in the probability of making an incorrect decision about the data (all other things being constant). The corollary to this is that you should not use a sample size larger than required for the power needed because it wastes resources.

There is another trade-off involving power. What if your sample size is fixed? How could you then adjust power? You could do it by adjusting α. Power and α are inversely related. If you decrease the significance of your results (e.g., from α=0.05 to α=0.10) then your power will show a corresponding increase. So, you can decrease the probability of making a Type II error by accepting a greater probability for making a Type I error, keeping the sample size constant.

Now we need to know, “How does one design an experiment with a particular power in mind?”

## Responses

1. […] See The Compleat Experimenter: 3. Experimentation and Power. […]

2. […] See The Compleat Experimenter:  3.  Experimentation and Power […]

3. […] See The Compleat Experimenter:  3.  Experimentation and Power […]