concept 38 sources

Randomized Controlled Trial

Citations audited:3 accurate 35 not yet audited
evidence-based-medicine epidemiology biostatistics
Eras twentieth-century, contemporary
First appearance 1946 (MRC streptomycin trial)

Summary

A randomized controlled trial (RCT) is a study design in which participants are randomly assigned to receive either an experimental intervention or a control condition, so that differences in outcome can be attributed to the intervention itself rather than to pre-existing differences between groups. The design was formalized in medicine in 1946, when Austin Bradford Hill used it to evaluate streptomycin as a treatment for tuberculosis, and it subsequently became the dominant standard of proof in clinical research. Proponents argue that randomization eliminates selection bias and that RCTs provide the clearest possible evidence of cause and effect. Critics have questioned whether populations enrolled in trials resemble real patients, whether commercial pressures distort study design, and whether statistical significance is being confused with clinical importance. The RCT is simultaneously medicine’s most powerful methodological tool and a site of unresolved philosophical dispute.


Historical Development

Precursors: Comparison Without Randomization

The impulse to compare treatments systematically predates the RCT by centuries. James Lind’s 1747 trial of citrus against scurvy aboard HMS Salisbury is routinely cited as a proto-RCT: Lind took twelve scorbutic sailors on the same diet in the same quarters and assigned them to six different remedies (cider, an electuary of garlic and myrrh, elixir of vitriol, vinegar, sea-water, or two oranges and a lemon daily), with the citrus pair on their feet within two weeks while the others showed little to no improvement.(Griggs, 1981) The design lacked randomization, a formal control arm, and the theoretical apparatus that would later justify such procedures. What Lind demonstrated was the comparative principle: that deciding between treatments requires simultaneous comparison under similar conditions, not sequential observation of unrelated cases.

The nineteenth century produced further systematic comparisons, but without randomization these remained vulnerable to the objection that investigators unconsciously allocated better-prognosis patients to their favored treatment. The problem was recognized by practitioners, but no statistical solution existed until Ronald Fisher’s work in agricultural field trials provided the conceptual foundation.

R. A. Fisher and the Logic of Randomization

Fisher’s work in the 1920s-1930s established that randomization serves two distinct purposes: it distributes unknown confounders evenly between groups by chance, and it provides the probabilistic basis for computing significance tests. Without randomization, significance testing rests on unverifiable assumptions about what other variables might be operating. The RCT design is therefore not merely a pragmatic precaution against bias; it provides the formal justification for the statistical inferences drawn from it.

The 1946 MRC Streptomycin Trial

The landmark was the Medical Research Council’s 1946 trial of streptomycin for pulmonary tuberculosis, designed by Austin Bradford Hill following Fisher’s principles. The British Medical Journal noted at the time that this was the first randomized controlled trial reported in human subjects, and that it offered “the clearest possible proof” that acute tuberculosis “could be halted by streptomycin” (Porter, 1997). Patients were allocated to streptomycin-plus-bed-rest or bed-rest alone by a process of sealed envelopes drawn from random number tables — a procedural detail that would become standard.

The streptomycin trial became a template not only because of its methodological features but because it arrived at a clear answer to a clinically pressing question at a moment when a new therapeutic class (antibiotics) was creating both urgency and opportunity for rigorous evaluation.

The Rise of Evidence-Based Medicine

From the 1960s onward the RCT spread across clinical specialties. Bernard Fisher’s application of randomization to breast cancer surgery was paradigmatic for the shift outside pharmacology. Fisher demanded a properly randomized trial to answer whether radical mastectomy improved survival over less extensive surgery, applying Neyman-Pearson statistical theory to a question that had been decided for decades by surgical opinion alone.(Mukherjee, 2010) The resulting NSABP-04 trial, published in 1981, found identical survival rates across radical mastectomy, simple mastectomy, and lumpectomy, definitively refuting the theoretical basis of the dominant surgical procedure.(Mukherjee, 2010) Mukherjee draws the methodological lesson that the randomized trial had proven itself the decisive arbiter: surgery without evidence was the problem, not surgery itself.(Mukherjee, 2010)

In Sackett and colleagues’ own retrospective account, the term “evidence-based medicine” was consolidated and named in 1992 by a group led by Gordon Guyatt at McMaster University in Canada, with the volume of EBM-related publications then rising from one in 1992 to roughly a thousand by 1998 (Sackett, David L. et al., 2000). Sackett places the deeper roots earlier, in post-revolutionary Paris, where Pierre Louis rejected authoritarian pronouncements (most famously the doctrine that venesection was good for cholera) in favor of systematic observation, with even older antecedents in Qing-dynasty Chinese kaozheng or “evidential research” (Sackett, David L. et al., 2000). The movement’s UK diffusion followed a 1992 introduction of clinical epidemiology to William Rosenberg by Muir Gray, a McMaster visit, training sessions for Oxford registrars, and the founding of the Centre for Evidence-Based Medicine after David Sackett’s arrival a year later (Sackett, David L. et al., 2000). The evidence-based medicine (EBM) movement codified hierarchies of evidence quality, placing systematic reviews and meta-analyses of RCTs at the top and anecdotal case reports at the bottom (Stegenga, 2018). RCTs were assigned near-top status because randomization was understood as the most reliable method for eliminating confounding.


Key Figures

Ronald A. Fisher (1890–1962): Statistician and geneticist who developed the theoretical basis for randomization in experimental design. Fisher’s work in agricultural field trials provided the probabilistic foundation that later justified the use of random allocation in clinical experiments.

Austin Bradford Hill (1897–1991): Epidemiologist who designed the 1946 MRC streptomycin trial (Porter, 1997). Hill also developed Hill’s criteria — nine considerations (strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy) that can ground causal inference even in situations where RCTs are impractical (Stegenga, 2018). Hill’s nuanced view — that criteria, not algorithms, govern causal inference — is often lost in later invocations of his name as an authority for RCT orthodoxy.

Bernard Fisher (1918-2019): Surgical oncologist whose NSABP-04 trial demonstrated that breast cancer is a systemic rather than locally spreading disease.(Mukherjee, 2010)(Mukherjee, 2010) Fisher’s work showed that the RCT method could overturn entrenched surgical doctrine backed by decades of practitioner consensus.

David Sackett (1934–2015) and the McMaster group: Sackett, an American clinical epidemiologist who founded McMaster University’s department of clinical epidemiology and biostatistics in the 1960s and later directed the Centre for Evidence-Based Medicine at Oxford, was the figure most responsible for translating EBM from a methodological program into a routine of bedside clinical practice. The 2000 textbook Evidence-Based Medicine: How to Practice and Teach EBM, written with Sharon Straus, Scott Richardson, William Rosenberg, and Brian Haynes, defines EBM as “the integration of best research evidence with clinical expertise and patient values” and structures the practice into five steps: converting an information need into an answerable question, tracking down the best evidence, critically appraising that evidence for validity, impact, and applicability, integrating the appraisal with clinical expertise and the patient’s circumstances, and evaluating one’s own performance (Sackett, David L. et al., 2000). Within this framework Sackett and colleagues distinguish three working modes: “appraising” (full search plus critical appraisal for everyday conditions), “searching” (using pre-appraised resources for less common problems), and “replicating” (accepting expert recommendations on rare problems), with the warning that the replicating mode is “blind” to whether the advice received is “authoritative (evidence-based) or merely authoritarian (opinion-based, resulting from pride and prejudice)” (Sackett, David L. et al., 2000). Audits of inpatient services applying these modes have reported that 82% of primary interventions were evidence-based, with 53% grounded in randomized trials or systematic reviews and 29% in non-experimental evidence (Sackett, David L. et al., 2000). Sackett’s group concedes that no RCT has yet shown EBM itself improves patient outcomes, noting the ethical awkwardness of randomizing clinicians to be denied access to evidence, and relies instead on observational outcomes research showing that patients receiving evidence-based therapies fare better than those who do not (Sackett, David L. et al., 2000). They also reject what they call “pseudo-limitations” of EBM, namely the charges that it denigrates clinical expertise, ignores patient values, or imposes cookbook medicine, while acknowledging genuine limitations including the skill burden, time pressure, and the slow accumulation of evidence that EBM works (Sackett, David L. et al., 2000).


Theoretical Framework

What Randomization Actually Does

Randomization addresses one specific threat to causal inference: the possibility that the treatment group differs from the control group at baseline in ways that independently affect outcomes. By allocating participants through a chance process, the investigator removes the possibility that their own expectations or patients’ characteristics determine group membership. Sackett’s textbook puts the case directly: random allocation “comes closer than any other research design to creating groups of patients at the start of the trial who are identical in their risk for the event we’re hoping to prevent,” balancing groups for known and unknown prognostic factors that could otherwise exaggerate, cancel, or counteract the effects of therapy (Sackett, David L. et al., 2000).

The reason this matters is that other techniques for handling confounding (exclusion, stratified sampling, matching, multivariate adjustment) all require that the investigator already know what the confounder is. Randomization is the only design that handles unknown confounders, because it distributes them by chance rather than by hypothesis (Sackett, David L. et al., 2000). Sackett’s appraisal checklist for an individual therapy trial follows from this: randomization with concealed allocation, follow-up sufficiently long and complete, and intention-to-treat analysis (all patients counted in the groups to which they were randomized), with secondary criteria of blinding, equal co-treatment, and baseline comparability (Sackett, David L. et al., 2000).

The design creates two populations that are, on average, equivalent across all known and unknown variables at the point of allocation. Subsequent differences in outcome can then be attributed, with probabilistic confidence, to the intervention. This reasoning explains why the RCT occupies a privileged position: no other observational design can make this claim without auxiliary assumptions.

The Black Box Thesis and Its Critics

One position about RCTs holds that they are both necessary and sufficient for causal inference about medical interventions — the “black box thesis” — meaning that a well-conducted RCT producing a significant result justifies concluding that the treatment causes the outcome without requiring knowledge of the mechanism (Stegenga, 2018). The competing “mechanista thesis” argues that mechanistic knowledge is additionally required to warrant a causal inference, because statistically significant associations can always be artifacts of unmeasured confounders, and because mechanisms provide the explanatory backbone that connects statistical association to biological reality (Stegenga, 2018).

Hill’s Criteria and the Limits of Randomization

Hill himself was not a strict advocate of RCT-only inference. In his famous 1965 address to the Royal Society of Medicine he enumerated nine considerations that could, in combination, ground causal inference even without experimental evidence. These have been called “Hill’s criteria,” though they are better understood as heuristics than as strict criteria (Stegenga, 2018). The criteria are relevant precisely because many important causal questions in medicine — whether smoking causes lung cancer, whether poverty causes poor health, whether a surgical era influences outcomes — cannot be randomized for ethical or practical reasons.

Sackett’s textbook adapts Hill’s framework into a working appraisal checklist for harm studies, where most evidence comes from cohorts and case-control designs rather than trials: ask whether the exposure preceded the outcome, whether there is a dose-response gradient, whether positive dechallenge-rechallenge evidence exists, whether the association is consistent across studies, and whether it makes biological sense.(Sackett, David L. et al., 2000) Effect-size thresholds matter as well. In a personal communication Sackett quotes, Richard Doll observed that “it’s almost impossible to set a level of risk which is so high that findings in a well-conducted epidemiological study would necessarily exclude confounding,” but that “20-fold excesses are in themselves almost sufficient to indicate causality.”(Sackett, David L. et al., 2000) The position concedes that an RCT remains the cleanest design for ruling out confounding while granting that, at sufficiently large effect sizes, observational inference can carry real causal weight on its own.


Reception and Controversy

The Replication Crisis and Research Bias

Stegenga documents a systematic problem: biases in medical research tend to generate evidence suggesting interventions are more effective and safer than they actually are (Stegenga, 2018). These biases operate at multiple points — study design, execution, analysis, and publication. P-hacking — analyzing data in multiple unconstrained ways to find a significant result — is one mechanism: when many statistical tests are applied to a single dataset, the probability of finding spurious signals increases substantially (Stegenga, 2018). Confirmation bias, the placebo effect, and regression to the mean in disease severity all operate to make ineffective interventions appear beneficial in uncontrolled settings (Stegenga, 2018).

External Validity: Whom Do Trials Represent?

A persistent objection to RCTs concerns external validity — whether results in a trial population can be extrapolated to the broader clinical population. Stegenga argues that simple extrapolation from trial populations to clinical populations is the working assumption of EBM, but is unwarranted because features of real patients — age, gender, severity of illness, pre-existing medications — systematically differ from trial subjects, who are selected partly by their fitness to enroll (Stegenga, 2018). A trial that demonstrates efficacy in a carefully screened population may not predict effectiveness in the heterogeneous populations presenting in clinical practice. Sackett’s own answer to the applicability problem, offered from inside the EBM tradition rather than against it, is to invert the standard inclusion-criterion check: rather than demanding that a patient match every entry criterion, the clinician should ask whether the patient is “so different from those in the study that its results are useless to us,” reserving rejection for the rare cases of qualitatively different pharmacogenetics, absent immune responses, or prohibitive co-morbidities (Sackett, David L. et al., 2000). The Stegenga and Sackett positions are not strictly incompatible (both grant that the inferential gap exists), but they differ on whether the default presumption should be that trial results travel.

Sackett also warns against the closely related temptation of subgroup-specific extrapolation. Apparent qualitative differences in treatment response between subgroups (this drug helps men but not women, the elderly but not the young) are extremely rare; early aspirin trials for transient ischemic attack appeared to show benefit only in men, but later trials and systematic reviews showed this was a chance finding and that aspirin works in both sexes. Unless a subgroup difference makes biological sense, was hypothesized before the trial, and has been confirmed in an independent replication, the overall trial result is the better starting point for the individual patient than any post-hoc stratum.(Sackett, David L. et al., 2000)

Outcome Measure Framing

The choice of how to present trial results profoundly shapes clinical and public inference. Relative risk reduction is systematically more impressive-sounding than absolute risk reduction (risk difference), and physicians and patients systematically overestimate drug benefits when presented with relative figures (Stegenga, 2018). A drug that reduces the absolute probability of death from 8% to 6% produces a 25% relative risk reduction, a figure that sounds dramatic but corresponds to one additional life saved for every fifty patients treated. The number needed to treat (NNT) statistic addresses this distortion: it is the inverse of the absolute risk reduction and answers the practical question of how many patients must be treated to prevent one additional bad outcome, with the analogous Number Needed to Harm (NNH) equal to one over the absolute risk increase. Sackett describes the NNT/NNH pairing as an effort-to-yield ratio, “the poor clinician’s cost-effectiveness analysis,” and notes that NNT is less commonly reported in clinical literature than relative risk (Sackett, David L. et al., 2000). Sackett also argues that confidence intervals should generally be preferred to bare P-values, because a CI conveys the range of effect sizes consistent with the data rather than only whether the result crossed a significance threshold; non-significant results in particular are prone to misinterpretation as showing equivalence when the data are equally compatible with clinically important differences (Sackett, David L. et al., 2000).

Sackett gives a specific warning about the equivalence fallacy. Sung et al.’s 100-patient randomized trial of octreotide infusion versus emergency sclerotherapy for acute variceal hemorrhage reported controlled-bleeding rates of 84% and 90% (P = 0.56) and concluded that the two treatments were “equally effective”; but the 95% CI for the 6% difference ran from -7% to +19%, wide enough that a clinically large difference in effectiveness could not be ruled out, so the equivalence conclusion was “certainly not valid.”(Sackett, David L. et al., 2000) Sackett also notes a peculiarity in deriving NNT confidence intervals when the underlying absolute-risk-reduction CI spans zero: taking reciprocals of the ARR endpoints yields negative and positive values that, properly read, denote NNT values from the lower bound to infinity and NNH values from the upper bound to infinity, rather than a tidy interval centered on the point estimate.(Sackett, David L. et al., 2000) A related reporting requirement: when two groups are compared, the appropriate CI is the one for the difference between groups, not separate CIs for each group; presenting the latter is “unhelpful” and “can be quite misleading,” because partial overlap of two single-group intervals does not correspond to a definite conclusion about the contrast of interest.(Sackett, David L. et al., 2000)

Commercial Influence

Stegenga groups industry conflicts of interest with research bias and small mean effect sizes as one of the four pillars supporting “medical nihilism” — the position that medical interventions on average are not nearly as effective as commonly believed.(Stegenga, 2018) In his framing, the bias problems documented in chapter 7 of Care and Cure are amplified rather than corrected by the commercial structure of contemporary trial sponsorship, with methods that already tend to overestimate effectiveness operating on data shaped by sponsors with a financial interest in positive results.

Ethical Requirements

The ethics of clinical research impose constraints on RCT design that have evolved substantially since 1946. AIDS activism beginning in the 1980s shifted biomedical research ethics from a framework concerned primarily with protecting subjects from harm toward a framework equally concerned with ensuring fair access to clinical trials (Tom L. Beauchamp, James F. Childress, 2013). The principle of therapeutic equipoise — genuine uncertainty among experts about whether the experimental or control condition is superior — is the standard justification for withholding one treatment from some participants. Without equipoise, a trial that allocates patients to an inferior treatment cannot be ethically justified.

Sinclair Lewis’s 1925 novel Arrowsmith anticipated these concerns decades before they became mainstream bioethical issues: the protagonist’s fictional plague trial, which withheld phage treatment from a portion of an afflicted plantation population to preserve a control group, was understood at the time as morally ambivalent rather than obviously wrong (Jackson (ed.), 2011). The Tuskegee syphilis study, beginning in 1932, later made the abuses of unethical controlled research impossible to ignore.


Legacy

The RCT reshaped medicine’s epistemic culture in the second half of the twentieth century. Before its widespread adoption, therapeutic decisions were grounded in pathophysiological reasoning (if this drug does X in the body, it should produce Y in the patient), expert opinion, and clinical experience. These are not worthless, but they are susceptible to systematic distortions that the RCT, at its best, can overcome. The streptomycin trial, the NSABP-04 breast cancer trial, and many subsequent landmark studies demonstrated that treatments widely practiced on plausible physiological grounds — including radical mastectomy, hormone replacement therapy in post-menopausal women, routine episiotomy — either failed to produce the expected benefits or caused net harm when submitted to randomized evaluation.

At the same time, the RCT’s ascent to gold-standard status has generated its own distortions. The ranking of evidence hierarchies that places RCTs above observational data and mechanistic reasoning has been used to dismiss evidence for interventions that cannot feasibly be randomized, and the EBM framework as practiced has been captured in part by commercial interests for which it was not designed. Porter’s observation that medicine “never deliberately stopped to resolve the fundamental issues of truth and method” applies here: the RCT was adopted opportunistically, without resolving whether it is a generally valid standard or an appropriate standard for specific types of questions (Porter, 1997).

Ackerknecht’s observation that antibiotic-resistant bacteria have created “an alarming new hospitalism” is a reminder that the diseases RCTs were designed to evaluate — acute, bacterial, and pharmacologically treatable — represent only one sector of human suffering, and that the methodology’s limitations become visible precisely in domains (chronic illness, complex multimorbidity, social determinants of health) where it is most difficult to apply (Ackerknecht, 1955).



See Also

  • evidence-based-medicine
  • austin-bradford-hill
  • hills-criteria
  • james-lind
  • streptomycin
  • statistical-inference
  • placebo-effect
  • therapeutic-equipoise
  • tuskegee-syphilis-study
  • nsabp-04
  • bernard-fisher
  • pharmaceutical-industry

Sources

  • Porter, Roy. The Greatest Benefit to Mankind: A Medical History of Humanity. 1997. Ch. 17 (port97-ch17-002)
  • Mukherjee, Siddhartha. The Emperor of All Maladies: A Biography of Cancer. 2010. Part III (muk10-part03-003, muk10-part03-004, muk10-part03-007)
  • Stegenga, Jacob. Care and Cure: An Introduction to Philosophy of Medicine. 2018. Ch. 7 and Ch. 9 (steg18-ch07-001, steg18-ch07-002, steg18-ch07-003, steg18-ch07-006, steg18-ch07-007, steg18-ch09-001, steg18-ch09-002, steg18-ch09-005)
  • Beauchamp, Tom L., and James F. Childress. Principles of Biomedical Ethics. 7th ed. 2013. Ch. 6 (bc13-ch06-010)
  • Jackson, Mark, ed. The Oxford Handbook of the History of Medicine. 2011. Ch. 23 (jac11-ch23-009)
  • Ackerknecht, Erwin. A Short History of Medicine. 1955. Ch. 21 (ack55-ch21-008)
  • Sackett, David L., Sharon E. Straus, W. Scott Richardson, William Rosenberg, and R. Brian Haynes. Evidence-Based Medicine: How to Practice and Teach EBM. 2nd ed. 2000. Preface, Introduction, Ch. 5 (Therapy), Appendix on Confidence Intervals (sack00-pre-003, sack00-ch00-003, sack00-ch00-004, sack00-ch00-007, sack00-ch00-008, sack00-ch00-009, sack00-ch00-010, sack00-ch00-011, sack00-ch05-001, sack00-ch05-002, sack00-ch05-003, sack00-ch05-004, sack00-ch05-010, sack00-ch15-001)

Editorial Notes

Gaps the encyclopaedia compiler flagged for future evidence work, collected from inline markers in the body and frontmatter.

R. A. Fisher and the Logic of Randomization

  • [GAP: specialist source needed — Box 1978 R. A. Fisher: The Life of a Scientist not in Library; Fisher’s statistical biography requires specialist mathematics/history-of-science source]

The Rise of Evidence-Based Medicine

  • [GAP: specialist source needed — Daly 2005 Evidence-Based Medicine and the Search for a Science of Clinical Care not in Library; independent historian’s account of the McMaster/Guyatt coining unattested]

Key Figures

  • Sackett TODO resolved via Sackett et al. Evidence-Based Medicine: How to Practice and Teach EBM, 2nd ed. (2000) — the second edition of the 1997 textbook, ingested 2026-05-01.

Sources

This article draws on 38 evidence cards from 8 sources.