2020
Predicting Success
Simulation can forecast probable success in clinical trials
by I. Elaine Allen and Christopher A. Seaman
Simulation in drug development is becoming standard practice for pharmaceutical and biotechnology companies. Considering that a phase three efficacy clinical trial for a potential new product could cost nearly $100 million, spending time on simulation activities before fully committing to developing a new product has proven to be increasingly worthwhile for more companies.
Knowing the probability of success in specific patient populations before investing big dollars can cut costs or even stop a trial before it starts. Using mathematical or stochastic models of clinical trials, including information about drug activity, drug availability and the disease process, have been discussed in statistical and clinical literature.^{1} The methods have spawned niche clinical research organizations that specialize in simulating and forecasting probable success in a clinical trial and software packages.^{2-4}
Another related application of simulation to clinical trials is simulating a futility index for a trial and using it as an interim endpoint for stopping the trial during an interim analysis.^{5} Using interim futility monitoring of phase three trials can substantially increase the probability of stopping trials early for futility when there is no treatment effect that can lead to meaningful reductions in study duration and patient resources in many disease settings without substantial loss of power for the primary test of efficacy.
The most complex simulations are multivariate and possibly Bayesian: They introduce patient and clinical covariates that may impact success and information derived from early clinical trials (phase one and phase two) or even preclinical results.
More recently, these simulations have been expanded to include evaluating test strategies for diagnostics^{6, 7} and building prediction models for the angel investment and venture capital community. All of these simulations can be considered as introducing efficiency and cost-effectiveness to the process.
We will examine two simulations that were constructed to evaluate the probability of success of ongoing trials, based on published information by biotechnology companies. These analyses were carried out for the investment community before it was recommended to commit funds to develop the product.
Antiviral compound example
Example one is a simulation of the results of an ongoing phase three trial comparing a new antiviral compound against standard treatment with a primary outcome of a difference in the percentage of patients with a viral load less than 400.
The trial includes 500 patients—250 randomized per treatment group increased from an initial 150 per group. This is a non-inferiority trial with one-sided difference of 13% determined to be an important clinical difference between the two treatments. The power of the test between groups is between 80% and 90%, and the level of significance is assumed to be alpha = 0.025 (two-sided 5% level of the test). The simulation was estimated to determine whether the trial was sufficiently sized and powered to show a significant result.
Usual non-inferiority design: The non-inferiority test setup is that the new treatment must be within the confidence bounds of 613% of the absolute percentage of patients with viral load < 400 copies in the control group with the following null and alternate hypotheses:
Null hypothesis:
| Control – treatment | > 13%
Alternate hypothesis:
| Control – treatment | < 13%
There are two criteria based on this one-sided confidence interval: Is the treatment above the non-inferiority lower bound of -13%, which would show non-inferiority, or is it above the upper bound of 613%, which would show superiority?^{8}
Sample size requirements to show superiority: Table 1 includes required sample sizes to show a significant non-inferiority or superiority result. The simulation varied the required sample size per group (double for trial size); used rounded estimates of 80%, 85% or 90% for power; started with a control rate of 60%, 65%, 70% or 75% of the patients with viral load < 400 copies; and varied the treatment rate from 75% to 90%.
Results of the simulation: Given the non-inferiority boundary, superiority can be claimed only if the difference is greater than 13%. If we assume the difference is between 85% and 71% (or 14%), as long as there are at least 175 patients per group that are analyzable (no dropouts), there will be a statistically significant difference with power of 87%. If the number of patients is higher, power will increase. If the number per group is smaller, however, or the difference is less than 13%, superiority can’t be claimed, but non-inferiority may be possible.
Another way to examine this is shown in Figure 1. Power is plotted against the sample size, and the hypothesized difference between 10% and 16% is given.
Hodges-Lehmann estimate
Example two is a simulation of the Hodges-Lehmann estimate for treatment versus placebo in a pulmonary arterial hypertension study.
This trial, already under way, compares two groups on a six-minute walk, with an estimated mean baseline distance of 350 meters, based on results from a similar clinical trial. The comparison statistic in the statistical analysis plan is the Hodges-Lehmann estimate paired with a rank-based test to get a significance value comparing the two groups. The simulation was intended to determine the minimum difference between groups that would be statistically significant.
Significance test criteria: The Hodges-Lehmann estimate looks at a paired comparison between treatment and a placebo patient by subtracting each treatment patient’s walk distance from each placebo patient’s walk distance. Its estimate is the median of those differences. To compare the groups, we look at the differences. A positive difference indicates a treatment patient walked longer than a placebo patient. Positive ranks (1, 2, 3 ...) are assigned to the positive differences and negative ranks (-1, -2, -3 ...) to the negative differences. The ranks are summed, and a large positive sum favors treatment over placebo (close to zero or no difference). The p-value is calculated from the number of positive differences, negative differences and the sum of the ranks.
Results of the simulation: Assuming the six-minute walk distance outcome is normally distributed for treatment and for placebo, based on previous clinical data, a Monte-Carlo simulation was performed looking at all possible differences comparing treatment and placebo groups.
Tables 2 and 3 show the median difference in walk distance and in Z-score for different mean walk distance differences with a baseline of 325 meters (standard deviation = 65 meters). Sensitivity analyses changing the 325 baseline mean did not change the conclusions.
The inputs for the model were distributions for baseline walk distance and treatment walk distance because the placebo group was not expected to change from baseline. These were normal distributions with mean 325 meters for baseline and 325-plus improvement for treatment. The standard deviation was assumed to be 65 meters for both. There was also inclusion criteria in the trial that baseline walk distances be between 200 and 450 meters, and treatment walk distance was assumed to be greater than 100 meters.
There were 1,000 trials run for each improvement value estimating the Hodges-Lehmann median difference and Mann-Whitney-Wilcoxon Z-score value for each simulation value. If the mean walk distance was longer, but the distributions remained identical, you would expect the same Hodges-Lehmann estimates and Z-score values. If the standard deviation was larger, there would be greater variation in the 1,000 Hodges-Lehmann estimates and Z-scores, but the median of the 1,000 would stay the same.
For a test between groups at the p = 0.05 level, there would need to be a minimum difference between groups of 18 meters to show statistical significance. For a test between groups at the p = 0.10 level, there needs to be a minimum difference between groups of 15 or 16 meters.
Significance in simulation
Simulation techniques can be used to examine the probability of a clinical trial reaching significance. This information is important for the sponsor of the trial, regulatory agencies and business investors.
The first example calculated the likelihood of a trial showing a significant difference between treatment and control based on a sample size, level of significance and power estimates. The difference between groups was calculated (in percentage reaching optimal viral load), and the sample sizes in which significance was guaranteed, possible and not likely given.
The second example offered the results of simulating a statistical test on ranks using the Hodges-Lehmann estimate as the overall statistic and the Mann-Whitney-Wilcoxon test. Using two levels of significance, the value of the Hodges-Lehmann estimate (in difference in meters walked between treatment and placebo) was found.
As shown in these two examples involving clinical trials, simulation can leverage a small amount of known information to reduce the range of possible outcomes of an experiment and identify where there is the most risk in drawing conclusions. It is most important to identify and understand the assumptions of the simulation at the outset and determine how confident you are in these assumptions to be able to interpret the resulting predicted outcomes with validity.
References and note
- R.L. Krall, K.H. Englemen, H.C. Ko and C.C. Peck, “Clinical Trial Modeling and Simulation—Work in Progress,” Drug Information Journal, 1998, No. 32, pp. 971-976.
- Pharsight, www.pharsight.com.
- CreatASoft, www.simulationsoftware.com.
- Icebergs, www.randomization.org.
- Bryan Goldman, Michael LeBlanc and John Crowley, “Interim Futility Analysis With Intermediate Endpoints,” Clinical Trials, 2008, No. 5, pp. 14-22.
- Ann G. Zauber, Iris Landsdorp-Vogelaar, Amy B. Knudsen, Janneke Wilschut, Marjolein van Ballegooijen and Karen M. Kuntz, “Evaluating Test Strategies for Colorectal Cancer Screening: A Decision Analysis for the U.S. Preventive Services Task Force,” Annals of Internal Medicine, 2008, Vol. 149, No. 9, pp. 659-668.
- Eva K. Lee and Marco Zaider, “Operations Research Advances Cancer Therapeutics,” Interfaces, 2008, Vol. 38, No. 1, pp. 5-25.
- I. Elaine Allen and Christopher A. Seaman, “Superiority, Equivalence and Non-Inferiority,” Quality Progress, February 2007, pp. 52-54. This Statistics Roundtable column discusses forming and testing hypotheses for non-inferiority.
I. Elaine Allen is the research director and professor of statistics and entrepreneurship at Babson College in Wellesley, MA. She earned a doctorate in statistics from Cornell University in Ithaca, NY. Allen is a member of ASQ.
Christopher A. Seaman is a doctoral student in mathematics at the Graduate Center of City University of New York.