Posted by: W. E. Poplaski | August 8, 2009

## THE COMPLEAT EXPERIMENTER: 4. Getting Power

THE COMPLEAT EXPERIMENTER

~ ♦ ~ ~ ♦ ~ ~ ♦ ~

O, sir, doubt not that experimenting is an art;

is it not an art to tease out the native hue of resolution?

~ ♦ ~ ~ ♦ ~ ~ ♦ ~

4. Getting Power

So, how do you design an experiment that has adequate power?  Will your experiment be able to ‘see’ a difference, if one exists, between the control and experimental treatments?  (Remember, power is the ability to avoid a type II error—or, in other words, to correctly detect a difference between the control and experimental means).  By convention, most experiments are designed with a power of, at least, 80%.  Adjusting the sample size is usually the easiest way to vary an experiment’s power.  Larger sample sizes provide more power.

When comparing two independent means, sample size and power can be related through this equation (from, e.g.,  http://www.childrensmercy.org/stats/size/power.asp, or textbooks such as Sokal & Rohlf’s Biometry)—

n≥ [(σ12 + σ22) x (Z(1-α) + Z(1-β))2] / [D2]

The sample size formula given above controls the sample size for both a given confidence level and power level.  From this equation we see that the values of three sets of factors are needed to solve for n (sample size):

• σ is the standard deviation; the subscripts 1 and 2 refer to the control and experimental treatments, respectively.
• Z is the Z-score; the subscripts (1-α) and (1-β) refer to confidence level and power, respectively.
• D is the smallest difference between the control and experimental means that we wish to detect.

However, note that sample size estimates are only approximations based on assumptions. So, to be conservative, you should consider the estimate to be the lower boundary for the necessary sample size (that is the reason for the equation’s “≥” sign) .

The guppy example will show how this equation can be used.  In that example, a scientist tested a new formulation of guppy feed.

***

Let’s say the scientist believes that her new formula will be commercially successful if guppies using it weigh at least 10% more (on average) than those on the standard feed.  She has spent many years studying guppies and knows that on the standard feed, female guppies typically have a mean weight of 0.3 grams with a standard deviation of 0.03 grams.

So, we can expect that the mean and standard deviation of our control treatment will be 0.3g and 0.03g. We want our experiment to detect the experimental treatment mean as being different from the control when it is at least 10% larger than the control mean—that is, when it is at least 0.33g.  Therefore, D is 0.33g – 0.3g=0.03g. (Concerning σ, in most cases it is reasonable to assume that there is no difference between the  standard deviations of the experimental and control treatments.)

We now have all of the factors of our equation: σ1=0.03g, σ2=0.03g, D=0.03.  Assuming, among other things, that the guppy weights are normally distributed, we obtain the z-scores from a z-distribution table

Z(1-α)=1.645, and Z(1-β)=0.84 (these are the z-scores for 1-α=0.95 and 1-β=0.80, respectively.  You can learn about z-scores, here).

Solving for n—

n≥ [(σ12 + σ22) x (Z(1-α) + Z(1-β))2] / [D2]

n≥ [(0.032 + 0.032) x (1.645 + 0.84)2] / [0.032]

n≥ [(0.0018) x (2.485)2] / [0.032]

n≥ [(0.0018) x (6.1752)] / [0.0009]

n≥ 0.0111/0.0009

n≥ 12.35

Therefore, if we wish to detect at least a 10% increase in the average weight of guppies using an experiment with a significance level of 0.05 and a power of 0.80, then at least 13 guppies are needed for each group (i.e., control and experimental).

***

What does that mean?

Imagine that the new formula works, causing the guppies’ weights to increase, on average, at least 10%.  Furthermore, imagine that 100 scientists are independently running identical experiments—each scientist using 13 control guppies on the standard feed and another 13 guppies on the experimental feed.

Because the experiment’s power was set at 80%, we would expect approximately 80 of the 100 scientists to get results that correctly indicate the experimental group is significantly different than the control group (of course, that means approximately 20 scientists will get results incorrectly indicating the two groups are not significantly different!)  So, if you were to do the experiment as described (with that sample size—13 guppies), you would have a 20% chance of missing a significant difference—i.e., making a Type II Error.

If this scientist’s analysis leads her to reject the null hypothesis, then a Type II error is no longer possible (because that is an error of failing to reject the null hypothesis, which she has  avoided by her decision) and the experiment’s power is no longer relevant.  However, in that case the question of a Type I error (an error of rejecting the null hypothesis, which is controlled by the experiment’s confidence level) is still relevant.

So, if an experimenter finds the difference between the control and treatment to be significant, one should ask what was the significance level,  (α), of that experiment (If the confidence level,  (1-α), is 0.95 or higher then one can be fairly confident those results are not due to chance).

However, if an experimenter fails to find a significant difference between the control and treatment, then one should ask if that experiment had enough power to detect a difference (If the power is at least 0.80, then one can be fairly confident that the treatment is no different than the control) .

***

What sample size would the scientist need if she wanted to detect at least a 5% difference between the control and experimental means (everything else remaining unchanged from the original calculation)?

Then the only change to the previous calculation is in D. It changes from 0.030 to 0.015 (because D=0.315-0.3=0.015; D2=0.000225), and therefore—

n≥ 0.0111/0.000225= 49.3 (at least 50 guppies are needed for each group).

This intuitively makes sense—we need a larger sample size to detect a smaller difference, all other things being the same.

Likewise, if our scientist only needs to detect at least a 20% difference, then D becomes 0.06, and—

n≥ 0.0111/0.0036= 3.1 (a minimum of only 4 guppies are needed for each group).

***

You might be thinking that a power of 80% is too small.  You do not like the idea of being wrong 20% of the time, and you ask, “So, how about setting the power to 95%? What sample size would we then need?”

In that case, the only change to the original calculation is that Z(1-β) changes from 0.84 to 1.645, and n becomes—

n≥ [(0.032 + 0.032) x (1.645 + 1.645)2] / [0.032]

n≥ [(0.0018) x (3.29)2] / [0.032]

n≥ [(0.0018) x (10.824)] / [0.0009]

n≥ 0.0195/0.0009

n≥ 21.6 (at least 22 guppies would be needed for each group).

***

And, what sample size would you need  if you wanted to detect at least a 5% increase in the average weight with a significance level of 0.05% and a power of 95%? (Try the calculation yourself. You should find that at least 87 guppies are required for each group.)

***

So, what are the take-home messages?

• The smaller the difference you wish to detect between the control and experimental means, then the larger the sample size needed to detect that difference as being significant, all other things being equal.
• Increasing the power of an experiment causes an increase in the probability of making a correct decision (regarding the significance of the difference between the control and experimental treatment means), which is the same as decreasing the probability of making a Type II Error.  It is accomplished by either increasing the sample size or increasing α.
• Since large sample sizes detect small differences as being statistically significant, it is important to step back and ask yourself if a statistically significant difference is also a relevant difference.  For example, a large enough sample size will detect even a 1% average increase in weight as being statistically significant.  However, consumers probably would not notice that difference (and therefore not be willing to pay more for the new formula feed).  So, in that case, we would probably not want (or need) our experimental results to indicate that the control and experimental treatment means are significantly different.  In fact, sample sizes that are much larger than needed are often just a waste of resources.