*THE COMPLEAT EXPERIMENTER*

~ ♦ ~ ~ ♦ ~ ~ ♦ ~

O, sir, doubt not that experimenting is an art; is it not an art to tease out the native hue of resolution?

~ ♦ ~ ~ ♦ ~ ~ ♦ ~

3. Experimentation and Power

The question all experimenters face is, “how do I avoid making Type I and Type II errors?” There are two main ways for experimenters to minimize Type I and Type II errors.

First, they can be as precise and accurate as possible in conducting the experiment. That means keeping extraneous variables as constant as possible, being careful in manipulating the explanatory variable and being careful in measuring the response variable. In this way, the experimenter will prevent extraneous sources of variation from blurring any distinction between the control and experimental treatment means. [To better understand this, imagine a grader who grades papers very sloppily. On three exams your scores appear to be 95, 85, and 75; your friend’s scores appear to be 99, 85, and 71. The two groups of scores don’t appear to be very different from each other. However, when you and your friend correct the grader’s errors, your scores become 92, 87, and 84; your friend’s scores become 83, 82, and 80. After correcting the grading, it becomes clear that there is a difference between the grades of you and your friend. This happens because the standard deviations for each group become smaller.]

Second, she can increase the sample size. Increasing the sample size does not generally affect the standard deviation, but it does decrease the standard error of the mean (SEM). The SEM is used in calculating confidence intervals (large sample sizes make for small confidence intervals around each of the treatment means, which make it easier to distinguish between them).

[To learn more about confidence intervals, see—

http://people.hofstra.edu/Stefan_Waner/RealWorld/finitetopic1/confint.html

http://www.acponline.org/clinical_information/journals_publications/ecp/sepoct01/primerci.pdf

http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/ .]

**Power** is the ability to correctly identify that the experimental treatment mean is significantly different from the control treatment mean (that is, the ability to avoid a type II error),and is expressed as a probability. Increasing the experiment’s sample size increases its power to distinguish between the two means (just as increasing precision did in the ‘sloppy test grades’ example). By convention, 80% power is considered an adequate level for most experiments. These websites describe statistical power more fully—

http://www.childrensmercy.org/stats/size/power.asp

http://www.socialresearchmethods.net/kb/power.php

http://www.indiana.edu/~statmath/stat/all/power/power.html#Size

So, what are the practical consequences of power to the experimenter? Imagine this simple experiment—

A scientist is attempting to create a new and improved formula for guppy feed. If she is successful—she strikes gold, because she will make a ton of money selling it. How will she decide if the new formula is better than the old one? By doing an experiment! She will base her decision on the weights of guppies that have been fed either the standard (treatment ‘A’) or new formula (treatment ‘B’) for two-months.

The null hypothesis (H_{O}) is that there is no difference in the mean weights between treatment ‘A’ and ‘B’ guppies. The alternative hypothesis (H_{Alt}) is that the mean weight of ‘B’ guppies is greater than that of ‘A’ guppies.

As is typical for many experiments, she sets the significance level, ‘α’, at 0.05. That means, if she were to repeat the experiment many times she would expect to see a significant weight gain by chance (as opposed to being caused by her new and improved formula) only 5% of the time (of course, that means 95% of the time the cause would be due to her formula). [BTW, α is the probability of making a Type I error, and ‘1- α’ (in this case, 0.95 or 95%) is the probability of correctly accepting H_{O}. α is called the ‘significance level’; 1-α is called the ‘confidence level’. β is the probability of making a Type II error; ‘1-β’ is termed ‘statistical power’]

Now, imagine that the experiment is run 1000 times, and suppose we know that for 70%—i.e., 700—of the experiments H_{O} will be the *in*correct choice (of course, this is something we cannot know in advance, but we assume it for the sake of argument. Some statisticians call this a “God’s eye view” of the situation). Let’s set the power of these 1000 experiments at 80% and the significance level, α, at 0.05. With these settings, we expect 560 of the 700 experiments would correctly reject the null hypothesis (because 0.80 X 700= 560, here 700 is the number of experiments for which H_{Alt} is actually the correct choice and 0.80 is the fraction of those we can expect to correctly identify). Likewise, we would falsely reject the null hypothesis in about 15 of the 300 experiments (because 0.05 X 300= 15, here 300 is the number of experiments for which H_{O } is actually the correct choice and 0.05 is the fraction that we can expect to incorrectly identify); these 15 experimental results are termed **‘false-positives’ **and are Type I errors.

So, for those experiments with positive results (i.e., experiments for which we decide to accept H_{Alt}) the ratio of correct to incorrect decisions is 560:15 or ~37:1 (a 2.6% error rate). However, if the power had been set at, say, 20%—instead of 80%—that ratio becomes ~9:1 (a 10% error rate).

After going through the time and expense of carrying out an experiment, you would hope to always avoid the embarrassment of having a false-positive result (i.e., you would like to have a 0% error rate). This is especially true in the field of medical diagnostics, where a false-positive might mean informing a patient that they have a disease when they are, in fact, healthy.

Unfortunately, there are **trade-offs** that make this impossible. For example, to get close to a 0% error rate, you would need an impossibly large sample size.

In order to do an experiment with a manageable sample size you must accept an increase in the probability of making an incorrect decision about the data (all other things being constant). The corollary to this is that you should not use a sample size larger than required for the power needed because it wastes resources.

There is another trade-off involving power. What if your sample size is fixed? How could you then adjust power? You could do it by adjusting α. Power and α are inversely related. If you decrease the significance of your results (e.g., from α=0.05 to α=0.10) then your power will show a corresponding increase. So, you can decrease the probability of making a Type II error by accepting a greater probability for making a Type I error, keeping the sample size constant.

Now we need to know, “How does one design an experiment with a particular power in mind?”

See The Compleat Experimenter: 3. Getting Power.

[…] See The Compleat Experimenter: 3. Experimentation and Power. […]

By:

THE COMPLEAT EXPERIMENTER: 2. Types of Mistakes « Veritason January 14, 2011at 5:22 pm

[…] See The Compleat Experimenter: 3. Experimentation and Power […]

By:

THE COMPLEAT EXPERIMENTER: 4. Getting Power « Veritason March 21, 2011at 8:45 am

[…] See The Compleat Experimenter: 3. Experimentation and Power […]

By:

THE COMPLEAT EXPERIMENTER: 4. Getting Power « Veritason March 21, 2011at 8:45 am