MethodsMay 2026 · 16 min read

Power calculations are not magic — a guide for investigators

Most power calculations are optimistic by design. Here's how to pressure-test yours before the IRB does.

Micah Thornton, MS — Thornton Statistical Consulting


What a power calculation actually says

A power calculation is a statement about the probability that your study will detect a specified effect, if that effect truly exists, given your sample size, your measurement variability, and your chosen significance threshold. That's a lot of conditionals. Every one of them is an assumption, and most of them will be wrong by the time data collection is done.

The standard target — 80% power — means there is a 1-in-5 chance of missing a real effect of the specified size. That is not a small risk. It means that if you ran 100 trials testing a treatment that genuinely works, and each trial was powered to 80%, roughly 20 of them would come up negative. Those are your false negatives, and the 80% target was chosen as a compromise between sample size and the cost of being wrong, not because it represents some intrinsically acceptable miss rate.

Power is not a property of the data. It is a property of the design, calculated before data collection. Once the trial is done, power no longer has a coherent interpretation — what you have is a result, with a confidence interval. "Post-hoc power" calculations, done after the fact to explain a null result, are circular and misleading. Avoid them.

The study is powered to detect a specific effect of a specific size. That effect size is not the truth. It is a planning assumption — and the most important design decision you make, far more consequential than the choice of 80% vs. 90% power.

Where the optimism comes in

Power calculations require three things: a target effect size, a measure of variability, and an alpha level. Investigators control two of those, and the calculation always produces a larger required sample for smaller effects and more variable outcomes. The pressure to keep sample sizes manageable — whether for cost, feasibility, or grant appeal — creates a structural incentive to choose inputs that make the study look achievable.

The most common source of optimism is the effect size. Investigators typically choose from one of three sources: prior literature, pilot data, or the smallest effect that would be "clinically meaningful." All three have problems.

Prior literature overestimates effects

Publication bias means the literature is not a random sample of all effects that were studied. Small positive results get published; small null results get filed. The effects in the literature are drawn from the right tail of the distribution of true effects, inflated by selection. The mean effect in a published literature is reliably larger than the mean effect in a replication study. Planning a sample size based on a published effect without discounting for publication bias is planning for a world the data don't actually live in.

Pilot data are too small to be informative

A pilot study with 20 participants has enormous uncertainty in its effect estimate. The confidence interval around a pilot's effect size is wide enough to include both "no effect" and "very large effect." Using the point estimate from a 20-person pilot to power a 200-person trial is treating sampling error as signal. The pilot tells you about feasibility, recruitment rates, outcome distributions, and measurement variability — not about the true effect size. Use it for the variance estimate. Don't use it for the effect size.

The "clinically meaningful" effect may not be achievable

The minimum clinically important difference (MCID) is the smallest effect that matters to a patient or clinician. Powering to the MCID is the principled approach — you are asking whether the treatment produces an effect large enough to care about. But the MCID is often chosen by the study team, not derived from patient-preference research, and it is frequently set at a level that is just barely feasible within budget. "We powered to the MCID" is the right framing; "we chose the MCID to make the study feasible" is a different thing.

The correct question is not "how large an effect would make this study feasible?" It is "how large an effect would make this treatment worth using?" Those two questions have different answers, and the gap between them is where underpowered literature comes from.

The optimistic assumption table

Here is where optimism most often enters the calculation, and what the honest alternative looks like:

AssumptionOptimistic versionHonest version
Effect sizeTaken from largest published estimate, no publication bias discountConservative estimate with discount, or range with sensitivity analysis
Standard deviationFrom a homogeneous pilot or best-case prior studyFrom a heterogeneous population similar to your enrollment target
Attrition5–10%, because "we'll work hard on retention"15–30% for interventions requiring sustained engagement; 10–15% for passive follow-up
Protocol adherenceFull compliance assumed in power modelEstimated non-compliance applied to dilute the effect size in the ITT analysis
Baseline event rateFrom a tertiary center with optimal diagnosis and treatmentFrom a community setting matching your enrollment sites
Correlation structureFavorable intraclass correlation assumed for clustered designsICC estimated conservatively from prior cluster data, not from the smallest published estimate
Enrollment rateMaximum feasible rate to minimize trial durationRate informed by screen failure history at your sites, with buffer

Specifying the target effect: SESOI

The most important concept for honest power planning is the smallest effect size of interest (SESOI): the smallest treatment effect that, if real, would be worth acting on — changing clinical practice, approving a drug, recommending a screening program. This is not the same as the effect you expect to see. It is the threshold below which the treatment is not useful, even if it exists.

Power your study to detect the SESOI, not the effect you hope to find. If the study is powered to detect the SESOI and comes up null, you have learned something informative: if the treatment works at all, it works below the level you care about. That is a meaningful result. If the study is powered to detect a large, optimistic effect and comes up null, you know almost nothing — the null result is consistent with a small-but-real effect you were never able to detect.

Powering to the SESOI typically requires larger sample sizes than powering to an optimistic effect estimate. That is the point. The additional sample size is the cost of getting an informative answer. A study that cannot distinguish "no effect" from "a small but real effect below the detection threshold" is not a study that can inform decisions.

Where does the SESOI come from? Several sources:

  • Patient-reported outcome studies that elicit minimum important differences directly from patients
  • Regulatory precedent — the effect size that was considered adequate for approval in similar indications
  • Absolute risk reduction thresholds from health economic analyses (number-needed-to-treat cutoffs)
  • Half the standard deviation of the outcome (Cohen's convention — weak support, but sometimes used as a starting point)
  • Expert consensus, pre-specified before data collection, documented in the protocol

Whatever source you use, document it in the protocol and the SAP. "We powered to detect a 3-point difference in the WOMAC pain subscale because previous patient preference studies identified 3 points as the minimum important difference in this population" is defensible. "We powered to detect a 3-point difference" with no rationale is not.

Attrition: the most underestimated input

Most power calculations inflate the nominal sample size by a modest percentage to account for dropout — typically 10–15%. Most trials exceed this estimate substantially. Attrition in intervention trials is routinely 20–40%, and in studies requiring sustained engagement (behavioral interventions, long follow-up, burdensome assessment schedules) it can exceed 50%.

When attrition exceeds the planning estimate, the effective sample size drops below the target. If the analysis uses last-observation-carried-forward or complete-case analysis, power drops further. If the analysis uses a mixed model under missing-at-random (the more appropriate modern approach), the power loss is somewhat attenuated — but not eliminated, and only if the MAR assumption is defensible.

The right way to plan for attrition:

  1. 1.Find historical attrition rates from similar studies in similar populations. Look at the full distribution, not just the mean.
  2. 2.Apply the 80th percentile of that distribution, not the median. Your study will have bad luck; plan for it.
  3. 3.Distinguish differential attrition (higher in one arm) from total attrition. Differential attrition is more damaging and harder to recover from analytically.
  4. 4.Inflate the sample size by 1/(1 - attrition rate), not by a flat percentage point. A 30% attrition rate requires enrolling 1/0.7 = 1.43x your target, not 1.3x.
  5. 5.Pre-specify the analysis approach for missing data. MMRM under MAR recovers more power than complete-case analysis and requires fewer additional participants.
An attrition adjustment that turns out to be too conservative results in a slightly overpowered study — an acceptable outcome. An attrition adjustment that is too optimistic results in an underpowered study that cannot answer its primary question, which wastes everyone's time and is never an acceptable outcome.

Clustered designs and the ICC tax

Cluster-randomized trials — where intact groups (clinics, schools, households) rather than individuals are assigned to treatment — require a design effect correction that can multiply the required sample size dramatically. The design effect is a function of the intraclass correlation coefficient (ICC): the proportion of outcome variance attributable to the cluster rather than the individual.

The design effect for a cluster-randomized trial is approximately 1 + (m - 1) × ICC, where m is the cluster size. For a clinic-randomized trial with 20 patients per clinic and an ICC of 0.05 — a relatively small clustering effect — the design effect is 1 + 19 × 0.05 = 1.95. You need almost twice the number of participants you would need in an individually-randomized trial. For an ICC of 0.10, the design effect rises to 2.9.

ICC values in the literature are noisy and context-dependent. The same outcome measured in different clinical settings can have ICCs that differ by an order of magnitude. Planning a cluster-randomized trial with an ICC from a study done in a different country, different care setting, or different patient population is a common and costly mistake.

When the ICC is uncertain — which is almost always — run the power calculation at three values: optimistic (low ICC), best-estimate, and conservative (high ICC). Report all three. If the study is not feasible at the conservative ICC, it is not a robustly powered study. The funder and the IRB deserve to know that.

Non-inferiority and equivalence margins

Power calculations for non-inferiority trials deserve their own treatment because they invert the usual logic. In a superiority trial, you are trying to detect a positive effect. In a non-inferiority trial, you are trying to rule out a negative effect — specifically, that the new treatment is not worse than the control by more than a pre-specified margin (the non-inferiority margin).

The non-inferiority margin is the most important design decision in a non-inferiority trial. It represents the largest reduction in efficacy that would still be acceptable given the new treatment's potential advantages (fewer side effects, oral vs. injectable, lower cost). Regulators require that the margin be pre-specified, justified, and smaller than the effect that established the active control's efficacy in its pivotal trials.

The M1/M2 framework (ICH E10) structures this: M1 is the full effect of the active control vs. placebo from historical data; M2 is the acceptable fraction of that effect that can be preserved — typically 50–80% of M1. The non-inferiority margin is M1 minus the required preservation fraction.

A non-inferiority trial with a margin that is too large is not evidence of non-inferiority — it is evidence that the sample size was chosen to make the study feasible rather than to answer the clinical question. The FDA and EMA will reject margins that are not anchored in the historical evidence about the active control. "We chose a margin of X because it was achievable with 200 patients" is not a justification.

Sample size for non-inferiority is often larger than for an equivalent superiority trial, because the null hypothesis (inferiority) is close to the boundary of interest and the confidence interval needs to be precise enough to exclude it. Plan accordingly.

Sensitivity analyses on the power calculation

A power calculation based on a single set of assumptions gives a false sense of precision. The right approach is to treat the power calculation as a model and run sensitivity analyses on its inputs — exactly the way you would run sensitivity analyses on a primary efficacy model.

At minimum, vary:

  • Effect size: optimistic estimate, best-estimate, conservative estimate (SESOI). Report the sample size for each.
  • Standard deviation: best-estimate ± 25–30%.
  • Attrition: 10%, 20%, 30% (or the range from prior literature).
  • Power level: 80%, 85%, 90%. Show what the additional participants buy you.
  • ICC (for clustered designs): the full plausible range.

Present this as a table or a contour plot (sample size as a function of effect size and standard deviation) rather than a single number. The visual makes clear how sensitive the sample size is to each input, and where the assumptions are load-bearing.

The concept of "assurance" formalizes this: instead of computing power under a point estimate of the effect size, assurance integrates over a prior distribution on the effect size to compute the expected probability of a successful trial. Assurance is almost always lower than the power calculated at the point estimate — usually substantially lower. If your study team has not encountered this concept, it is worth introducing before the protocol is finalized.

What the IRB and regulators actually check

IRBs are primarily concerned with whether the sample size is large enough to answer the scientific question and small enough to avoid unnecessary participant exposure to risk or burden. A power calculation that is clearly reverse-engineered from a feasible sample size will attract scrutiny. A power calculation with a clear scientific rationale for the effect size will not.

Regulators — FDA, EMA, PMDA — have more exacting expectations for registration trials. The FDA's Division-specific guidance documents often specify expected sample sizes, event counts, or precision requirements for particular indications. The power calculation for a registration trial should be benchmarked against that guidance, not just against published literature.

For FDA submissions, the key questions are:

  • Is the effect size assumption consistent with the pre-specified primary endpoint definition?
  • Is the variance estimate from a comparable population, not a best-case scenario?
  • Is the non-inferiority or superiority margin pre-specified and clinically justified?
  • Is the multiplicity adjustment reflected in the power calculation (for co-primary or hierarchical endpoints)?
  • Is the attrition assumption realistic given the trial duration and the indication?

End-of-Phase 2 meetings with the FDA specifically address the Phase 3 design, including sample size. Using that meeting to get explicit FDA buy-in on the power assumptions is one of the highest-value activities in drug development. A Phase 3 trial launched without FDA concurrence on the design is a trial that may or may not be interpretable as registration-enabling.

Multiplicity and its effect on power

When a trial tests more than one primary or key secondary endpoint, the power calculation must account for the multiplicity adjustment. If you are using a hierarchical testing procedure — testing endpoints in pre-specified order, claiming significance only when all preceding endpoints are significant — the power for the secondary endpoint is the probability of being significant on both the primary and the secondary, which is lower than the power for either one alone.

For a trial with two co-primary endpoints (where both must be significant to claim success), the joint power is the product of the individual powers if the endpoints are independent — and typically lower than the product if they are correlated in the wrong direction. For two co-primary endpoints each powered to 90%, the joint power is at most 81%, and often lower. If you need 80% joint power, you need individual powers of roughly 90%.

Correlation between endpoints is a double-edged sword in co-primary designs. High positive correlation can increase joint power (if one endpoint is large, the other is likely to be too); high negative correlation reduces it. Specify the assumed correlation structure in the power calculation for co-primary or multiple-endpoint trials. It is rarely the case that endpoints are independent, and the independence assumption usually flatters the power estimate.

Adaptive designs and conditional power

Adaptive trials complicate the power calculation in ways that are worth understanding before you decide to pursue one. An adaptive design that allows sample size re-estimation based on an interim analysis can recover power if the nuisance parameters (variance, event rate, ICC) were mis-estimated at design time. But the re-estimation is done under blinding constraints that limit what information can be used, and the re-estimation rule must be pre-specified.

The key concept is conditional power: given the data observed at the interim, what is the probability of achieving a significant result at the final analysis, assuming the true effect is as specified? If conditional power falls below a futility threshold (often 20–30%), the study may be stopped for futility. If the nuisance parameter estimate at the interim suggests the original sample size is insufficient, the study may be extended.

Sample size re-estimation based on the unblinded treatment effect is a different matter. Increasing sample size because the observed effect is smaller than planned can introduce operational bias — unblinded team members may behave differently once they know the direction of the effect. The methods for this (Cui-Hung-Wang, Chen-DeMets-Lan) are established, but they require careful statistical oversight and pre-specification.

Adaptive designs are not a way to rescue an underpowered trial. They require more rigorous planning than fixed designs, more robust operating procedures, and more sophisticated statistical methodology. If your team is considering an adaptive design primarily because the fixed design sample size is unfeasible, that is a warning sign that the effect size or budget assumptions need revisiting, not that adaptive design is the solution.

Common power calculation mistakes

Mistake 1: Using the point estimate from an underpowered pilot.

A pilot study's effect estimate is unreliable. The confidence interval around it is wide. Using the point estimate — especially if it is large — to power the definitive trial is importing uncertainty as if it were information. Use the pilot for the variance estimate and for feasibility data. Use independent literature, patient preference data, or regulatory precedent for the effect size.

Mistake 2: Single-scenario power calculation.

A table showing one sample size under one set of assumptions is not a power justification; it is a number. A table showing sample size across a range of effect sizes and standard deviations is a justification. Any reviewer or funder should be able to see how sensitive the answer is to the key assumptions.

Mistake 3: Forgetting the analysis model when computing power.

Power calculations are often done for a t-test or chi-square test even when the planned analysis is a linear mixed model, a proportional hazards model, or a GEE. These are not equivalent. The mixed model for repeated measures has more power than a t-test on the change score for the same sample size, because it uses all the within-subject data. Using a simple test to compute power and then analyzing with a more powerful model means you enrolled more people than you needed — which is a resource waste and an ethics question. Use the power formula that matches your analysis model.

Mistake 4: Attrition applied incorrectly.

The inflation factor for attrition is 1 / (1 - dropout rate), not a simple addition. If you need 100 completers and expect 25% dropout, you need to enroll 100 / 0.75 = 133, not 125. For 40% dropout, you need 167, not 140. The error compounds at higher dropout rates and is routinely underestimated in protocol budgets.

Mistake 5: No justification for the chosen power level.

"We powered the study to 80%" is not a justification — it is a statement that you used the convention. Is 80% appropriate? For a study whose results will directly inform regulatory approval, 90% is more defensible. For a Phase 2 signal-finding study, 80% may be appropriate. The power level is a choice about how much risk of missing a real effect is acceptable. State why you chose it.

Mistake 6: The calculation is not aligned with the primary analysis.

Power was calculated for a comparison of means at a single timepoint, but the primary analysis is a mixed model for repeated measures. Or power was calculated assuming equal randomization, but the protocol specifies a 2:1 randomization ratio. Or power was calculated for a superiority hypothesis, but the protocol specifies a non-inferiority objective. Any mismatch between the power calculation and the protocol will surface in peer review or regulatory review, and it will raise questions about which one is authoritative.

Practical power calculation checklist

Before submitting to the IRB or filing a registration trial protocol, verify each of these:

  1. 1.Effect size is justified with a source. Literature with publication bias discount, patient preference data, or regulatory precedent. Not pilot point estimate.
  2. 2.The SESOI is defined and documented. The smallest effect that would change clinical practice or be worth implementing.
  3. 3.Standard deviation estimate is from a comparable population. Same indication, similar setting, similar measurement protocol.
  4. 4.Attrition estimate is justified and conservatively set. At the 80th percentile of comparable studies, applied as 1/(1-rate), not a flat addition.
  5. 5.Sensitivity table covers a range of effect sizes, SDs, and attrition rates. Single-scenario tables are not sufficient for a rigorous protocol.
  6. 6.Power formula matches the primary analysis model. Mixed model power for a mixed model analysis, not a t-test proxy.
  7. 7.Multiplicity is reflected in the power calculation. Joint power for co-primary endpoints; adjusted alpha for hierarchical families.
  8. 8.ICC adjustment is applied for clustered designs. Justified from comparable literature, with sensitivity to the ICC estimate.
  9. 9.Non-inferiority margin is clinically justified (if applicable). Anchored in the M1/M2 framework, not chosen for sample size convenience.
  10. 10.The chosen power level is stated and justified. 80%, 85%, 90% — each is a choice about acceptable miss rates. State the rationale.

Bottom line

A power calculation is not a ritual that produces a sample size. It is a model of your study that forces you to commit to what effect you are trying to detect, how variable your outcome is, and how much uncertainty in those estimates you are willing to carry. The quality of the power calculation is a leading indicator of the quality of the study.

Most power calculations are too optimistic because every input has a natural direction toward smaller sample sizes, and the pressures of feasibility, budget, and grant appeal all push in the same direction. The corrective is to be deliberately conservative — not pessimistic, but honest — about each assumption, and to show the sensitivity of the sample size to that conservatism.

An underpowered study is not just a failed study. It is a study that consumed participants' time and risk without producing interpretable results — which is an ethics problem, not just a methods problem. The IRB's job is to ensure that the research burden on participants is justified by the knowledge that will be produced. An optimistic power calculation that leads to an underpowered study fails that test.

If your power calculation tells you a study is not feasible at the effect size that is clinically meaningful, the right answer is usually not to inflate the effect size until it is feasible. It is to redesign the study — change the outcome, extend the follow-up, enrich the population, find a more efficient analysis model, or accept that this question requires a larger investment than the current budget supports.


Need help with your power calculation?

I review and pressure-test power calculations for IRB submissions, grant applications, and registration trial protocols — with a sensitivity analysis and a written justification you can defend to a reviewer.