Bioequivalence and Bioavailability Forum • A rant

Helmut
★★★

Vienna, Austria,
2024-08-26 14:52
(317 d 20:07 ago)

Posting: # 24160
Views: 2,696

A rant [Two-Stage / GS Designs]

To whom it may concern.

With ICH M13C on the horizon (initial work is expected to start in December 2024), a publication [1] come to my attention. A great review outlining the current approaches and conditions in various jurisdictions. However, I think that the authors erred in one specific case, which I give below in part. I changed only the numbering of references and added two [5, 9] which are missing in this section of the discussion (pages 14–15 of the PDF).

Adaptive designs
[…]
According to Reference [2], if the sample size of both stages is the same, the confidence level does not need to be 95% in both stages, which corresponds to the conservative Bonferroni’s adjusted. Instead, it can be 94.12% (which corresponds to an alpha of 0.0294, since the corresponding confidence level is (100-(2·alpha·100)), as indicated in the EMA guideline [3]. However, that paper refers to parallel designs and it has been highlighted that the value is also unnecessarily conservative for cross-over designs [4]. A confidence level of 93.92% (alpha of 0.0304) could be used in the case of the Pocock design for 2 × 2 cross-over trials. The WHO guideline [5] also includes the possibility of not spending any alpha in the interim analysis after stage 1. This is because BE is not intended to be demonstrated in the interim analysis (i.e., the confidence interval is not calculated). Instead, this information can be used for a futility analysis, which does not use any alpha, i.e., to stop the trial when demonstration of equivalence is unlikely with the pre-defined sample sizes. In Canada and the United States, the recommendation of method C published by Potvin et al is/was an acceptable approach [6]. However, additional papers [7, 8] provided different conclusions due to changes in the assumptions of the simulated scenarios. As a result, these simulations are not considered enough justification of preservation of the overall type I error (alpha) in the EU. In fact, as Maurer et al. [9] explained none of the 4 variations investigated by Potvin et al. formally controls the type I error rate of falsely claiming ABE, even though the amount of inflation produced by Method C was considered acceptable. A major disadvantage of assessing type I error rate inflation using simulation is that without exploring all possible scenarios for the intended design and analysis, it is impossible to ensure control of the type I error rate [10, 11].

Some comments:

According to Reference [2], if the sample size of both stages is the same, …
Even in a group sequential design (GSD) this is practically never the case. An all too often overlooked detail: If the interim is at < ½N (due to dropouts), one has to use an error-spending function (e.g., Lan and DeMets, Jennison and Turnbull) to control the Type I Error.
… the confidence level […] can be 94.12% (which corresponds to an alpha of 0.0294, since the corresponding confidence level is (100-(2·alpha·100)), as indicated in the EMA guideline [3].
While correct, that’s a strange way to specify the confidence level in percent; more common is 100(1 – 2α). The guideline’s 94.12% = 100(1 – 2 × 0.0294) is given as an example and continues with »… there are many acceptable alternatives and the choice of how much alpha to spend at the interim analysis is at the company’s discretion.«
… the value is also unnecessarily conservative for cross-over designs [4]. A confidence level of 93.92% (alpha of 0.0304) could be used in the case of the Pocock design for 2 × 2 cross-over trials.

That’s simply wrong. I wish they would have read paper(s) beyond the abstract(s). Pocock’s α = 0.0294 is for superiority testing in a parallel GSD with a fixed total sample size N, assuming equal variances (conventional t-tests), and one interim at exactly N/2 [2]. Kieser and Rauch [4] didn’t get it right as well: α = 0.304 is Pocock’s for equivalence but again in a parallel GSD and not for a 2×2×2 crossover design. When exploring the supplementary material of [4], in a few cases a slight inflation of the Type I Error can be seen. For the conditions of [6] we found that α = 0.0301 is required to control the Type I Error.
BTW, who is interested in a GSD? It requires a best guess CV to estimate N, which is – due to the adjusted alpha – always larger than the fixed sample design’s n.^A
CV n N n to N 0.30 156 186 +19.23% 0.35 208 248 +19.23% 0.40 266 318 +19.55% 0.45 332 396 +19.28% 0.50 400 478 +19.50%
IMHO, a GSD would only be advantageous if planned for early stopping in the interim due to futility.
On the other hand, if the CV in the first stage is much lower than the one we assumed, only a few more subjects would be needed to show BE. However, we still have to go full throttle and perform the second stage with N/2 subjects. This is both ethically and economically questionable. Essentially we ignore the information of the first stage and rely only on our assumptions. So why not use a fixed sample design in the first place?
In Canada and the United States, the recommendation of method C published by Potvin et al is/was an acceptable approach [6].
Correct according to the guidances and confirmed by the FDA [12].
However, additional papers [7, 8] provided different conclusions due to changes in the assumptions of the simulated scenarios.
Of course, different assumptions (GMR, CV, target power) will require different alphas. That’s trivial. What the heck is meant by »different conclusions«?
As a result, these simulations are not considered enough justification of preservation of the overall type I error (alpha) in the EU.
Well roared lions! I got too many deficiency letters on my desk, which is not surprising [13, 14]. While for the 2×2×2 crossover design the exact method [9] is implemented in the -package Power2Stage [15] since October 2017, only simulation-based methods are available for a parallel design [16, 17].
In fact, as Maurer et al. [9] explained none of the 4 variations investigated by Potvin et al. formally controls the type I error rate of falsely claiming ABE, even though the amount of inflation produced by Method C was considered acceptable.
Of course, there is no formal proof of controlling the Type I Error in simulation-based methods.
Maurer et al. [9] state correctly »The conclusion we want to emphasize is that if the parameters of the design are changed, there is no guarantee with Method B or any method that uses a TOST procedure on the pooled data from both stages as a second stage test, that the type I error rate will not be inflated above the nominal value.« (my emphasis)
»[…] the method should be prespecified. This study did not seek to find the best possible two-stage design, but rather to find good ones that could be used by sponsors without further validation.« [6] In other words, if one wants something which is not published and validated, own simulations are required. There is no free lunch.
Any modification of a simulation-based method simply must not be done. Nevertheless, the authors modified (GMR 0.85, power 90%) ‘method B’ and therefore, observed an inflated Type I Error in Figure 6. That’s (i) not surprising because such a setup is not covered by this framework (GMR 0.95, power 80%) [6] and (ii) thus, cannot be used as an argument against simulation-based methods. Different GMR and/or power require a different adjusted α [18, 19]. Easy to show. Although not necessary for [9], we can also assess its empiric Type I Error [15].^B
setup CV n1 alpha GMR power TIE Potvin et al. [6] method B 0.4 36 0.0294 0.95 0.8 0.04871 Molins et al. [19] Type 1 0.4 36 0.0301 0.95 0.8 0.04984 Maurer et al. [9] Figure 5 0.4 36 0.0263 0.95 0.8 0.04538 Maurer et al. [9] Figure 6 0.4 36 0.0294 0.85 0.9 0.05524 Molins et al. [19] Type 1 0.4 36 0.0262 0.85 0.9 0.04889 Maurer et al. [9] 0.4 36 0.0263 0.85 0.9 0.04528
More adjustment (i.e., smaller α → wider CI) has to be used if the GMR moves away from 1 and/or the desired power increases. Simple.
A major disadvantage of assessing type I error rate inflation using simulation is that without exploring all possible scenarios for the intended design and analysis, …
What might be meant by »all possible scenarios«? An infinitesimally small step size for the CV and n₁ with a step size of 1?
How many simulations are considered sufficient? One million (which is the current standard) or even more?

I saw deficiency letters stating that the grid of [6] (step size for CV 10% and for n₁ 12) was too ‘coarse’. Come on! Easily doable with an extremely narrow grid size. The surfaces in the right panel of the plot below show the power in the first stage (lower one) and the final analysis (upper one).

What do regulators suspect for simulation-based methods? That there is a sudden ‘spike’ 🚀 in the Type I Error surface at the CV and n₁ observed in the study? Gimme a break! In two deficiency letters assessors required a post hoc evaluation of the empiric Type I Error. See also this discouraging post and that one.
… it is impossible to ensure control of the type I error rate [10, 11].
I fail to find such a statement in both references. Sorry folks, are you on a mission?

At the end of the discussion we find:

[…] the current review publication […] from the IPRP BEWGG describe several topics that will be addressed under future ICH projects (e.g., ICH M13B and M13C guidelines). It is evident that the continued efforts of the IPRP BEWGG will have a notable impact with respect to promoting collaboration and achieving regulatory convergence and harmonization in the field of BE study performance.
I sincerely hope that the ICH M13 Expert Working Group has more statistical expertise than the 30 (‼) authors of this publication.

We presented the exact method in two posters [20, 21] but were too lazy to write a paper. We wear the black belt in procrastination. A colleague of Byron Jones noticed our 2015 poster and conveyed its content. Thus it is referenced in [9].
I became interested in adaptive designs for bioequivalence almost thirty years ago [22, 23]. It is disheartening to observe the lack of advancement and the prevalence of misinterpretations of methods [24].

Freitas Fernandes EA, van Oudtshoorn J, Tam A, Arévalo González LA, Aurela EG, Potthast H, Mettke K, Kuribayashi R, Shimojo K, Kasuga M, Morales L, Rodríguez Z, Jones B, Ahn C, Yun E, Kim SH, Rodrigues C, Tiong T, Crane C, Walther C, Roost MS, Chen T-L, Hsu L-f, Braddy AC, García-Arieta A, Abalos I, Divinsky M, Alsuwyeh A, Alzenaidy B, Alhar A. The bioequivalence study design recommendations for immediate-release solid oral dosage forms in the international pharmaceutical regulators programme participating regulators and organisations: differences and commonalities. J Pharm Pharmaceut Sci. 2024; 27: 12398. doi:10.3389/jpps.2024.12398
Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika. 1977; 64(2): 191–9. doi:10.2307/2335684
EMA (CHMP). Guideline on the Investigation of Bioequivalence. London. 20 January 2010.
Kieser M, Rauch G. Two-stage designs for cross-over bioequivalence trials. Stat Med. 2015; 34(16): 2403–16. doi:10.1002/sim.6487
WHO Expert Committee on Specifications for Pharmaceutical Preparations. Multisource (generic) pharmaceutical products: guidelines on registration requirements to establish interchangeability. Fifty-first report. Technical Report Series, No. 992, Annex 6. Geneva. April 2017.
Potvin D, DiLiberti CE, Hauck WW, Parr AF, Schuirmann DJ, Smith RA. Sequential design approaches for bioequivalence studies with crossover designs. Pharm Stat. 2008; 7: 245–62. doi:10.1002/pst.294.
Montague TH, Potvin D, DiLiberti CE, Hauck WW, Parr AF, DJ Schuirmann DJ. Additional results for ‘Sequential design approaches for bioequivalence studies with crossover designs’. Pharm Stat. 2011; 11: 8–13. doi:10.1002/pst.483
Xu J, Audet C, DiLiberti CE, Hauck WW, Montague TH, Parr TH, Potvin D, Schuirmann DJ. Optimal adaptive sequential designs for crossover bioequivalence studies. Pharm Stat. 2016; 15(1): 15–27. doi:10.1002/pst.1721
Maurer W, Jones B, Chen Y. Controlling the type 1 error rate in two-stage sequential designs when testing for average bioequivalence. Stat Med. 2018; 37(10): 1–21. doi:10.1002/sim.7614.
South African Health Products Regulatory Authority. Quality and Bioequivalence Guideline. Pretoria. 2023.
Lee J, Feng K, Xu M, Gong X, Sun W, Kim J, Zhang Z, Wang M, Fang L, Zhao L. Applications of adaptive designs in generic drug development. Clin Pharm Ther. 2020; 110(1): 32–5. doi:10.1002/cpt.2050
Jiang X. Adaptive design and alpha adjustment: FDA position. Presentation at: 5^th GBHI conference. Amsterdam. 28 September 2022.
Kaza M, Sokolowki A, Rudzki PJ. 10th Anniversary of a Two-Stage Design in Bioequivalence. Why Has it Still Not Been Implemented? Pharm Res. 2020; 37(7): 140, doi:10.1007/s11095-020-02871-3
Urach S. Two stage Designs and their Acceptability in the EC Area. Presentation at: 6^th International Workshop – GBHI 2024. Rockville, MD. 16 April 2024.
Labes D, Lang B, Schütz H. Power2Stage: Power and Sample-Size Distribution of 2-Stage Bioequivalence Studies. Package version 0.5-4. 2021-11-20. https://cran.r-project.org/package=Power2Stage
Fuglsang A. Sequential Bioequivalence Approaches for Parallel Design. AAPS J. 2014; 16(3): 373–8. doi:10.1208/s12248-014-9571-1
Schütz H. Novel approaches in adaptive designs and α adjustment, e.g., with futility criteria and for parallel design studies. Presentation at: 5^th GBHI conference. Amsterdam. 28 September 2022.
Fuglsang A. Sequential Bioequivalence Trial Designs with Increased Power and Controlled Type I Error Rates. AAPS J. 2013; 15: 659–61. doi:10.1208/s12248-013-9475-5
Molins E, Labes D, Schütz H, Cobo E, Ocaña J. An iterative method to protect the type I error rate in bioequivalence studies under two-stage adaptive 2×2 crossover designs. Biom J. 2021; 63(1): 122–33. doi:10.1002/bimj.201900388
König F, Wolfsegger M, Jaki T, Schütz H, Wassmer G. Adaptive two-stage bioequivalence trials with early stopping and sample size re-estimation. Vienna: 2014; 35^th Annual Conference of the International Society for Clinical Biostatistics. Poster P1.2.88. doi:10.13140/RG.2.1.5190.0967
König F, Wolfsegger M, Jaki T, Schütz H, Wassmer G. Adaptive two-stage bioequivalence trials with early stopping and sample size re-estimation. Trials. 2015; 16(Suppl 2); P218. doi:10.1186/1745-6215-16-S2-P218
Gould AL. Group Sequential Extensions of a Standard Bioequivalence Testing Procedure. J Pharmacokin Biopharm. 1995; 23(1): 57–86. doi:10.1007/bf02353786
Hauck WW, Preston PE, Bois FY. A group sequential approach to crossover trials for average bioequivalence. J of Biopharm Stat. 1997; 7(1): 87-96. doi:10.1080/10543409708835171
Schütz H. Two-stage designs in bioequivalence trials. Eur J Clin Pharmacol. 2015; 71(3): 271–81. doi:10.1007/s00228-015-1806-2

[image] -scripts:

Comparison of fixed sample (n) and GSD (N):
library(PowerTOST) alpha <- c(0.05, 0.0304) # 2nd element: Pocock’s for equivalence, parallel design theta0 <- 0.90 # reasonable, given the high CVs CV <- seq(0.3, 0.5, 0.05) x <- data.frame(CV = CV, n = NA_integer_, N = NA_integer_) for (j in seq_along(CV)) { x$n[j] <- sampleN.TOST(alpha = alpha[1], CV = CV[j], theta0 = theta0, design = "parallel", print = FALSE)[["Sample size"]] x$N[j] <- sampleN.TOST(alpha = alpha[2], CV = CV[j], theta0 = theta0, design = "parallel", print = FALSE)[["Sample size"]] } x$penalty <- sprintf("%+.2f%%", 100 * (x$N / x$n - 1)) names(x)[4] <- "n to N" print(x, row.names = FALSE)
Misspecification of GMR and power in method B (only valid for GMR 0.95 and power 80%) by Maurer et al. in Figure 6, correct alphas are 0.0262 in both stages. Assessment of the exact method by one mio simulations for comparison.
library(Power2Stage) x <- data.frame(setup = c("Potvin et al. [6] method B", "Molins et al. [19] Type 1", "Maurer et al. [9] Figure 5", "Maurer et al. [9] Figure 6", "Molins et al. [19] Type 1", "Maurer et al. [9]"), CV = rep(0.4, 6), n1 = rep(36, 6), alpha = c(0.0294, 0.0301, NA, 0.0294, 0.0262, NA), GMR = c(rep(0.95, 3), rep(0.85, 3)), power = c(rep(0.8, 3), rep(0.9, 3))) for (j in 1:6) { if (j %in% c(1:2, 4:5)) { # method B / Type 1 x$TIE[j] <- power.tsd(method = "B", alpha = rep(x$alpha[j], 2), CV = x$CV[j], n1 = x$n1[j], GMR = x$GMR[j], theta0 = 0.8, targetpower = x$power[j])$pBE } else { # Maximum Combination Test with futility y <- power.tsd.in(CV = x$CV[j], n1 = x$n1[j], GMR = x$GMR[j], theta0 = 0.8, targetpower = x$power[j]) x$alpha[j] <- signif(y$alpha[1], 3) x$TIE[j] <- y$pBE } } x$TIE <- signif(x$TIE, 4) print(x, row.names = FALSE, right = FALSE)

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

Achievwin ★★ US, 2024-08-28 04:16 (316 d 06:43 ago) @ Helmut Posting: # 24166 Views: 1,966	A rant Post reply
	I feel your pain, just curious if the two stage design went beyond 2x2 crossover or we are stuck in 1990s time? Any example how two stage adaptive design is applied/validated for parallel study or replicate design? In case of parallel design what kind of statistical factor we need to use Potvin or Bonferroni and same for 4-period or 3-period replicate design?

Helmut
★★★

Vienna, Austria,
2024-08-28 12:54
(315 d 22:06 ago)

@ Achievwin
Posting: # 24168
Views: 1,930

Parallel (replicate?)

Post reply

Hi Achievwin,

❝ I feel your pain…

Thank you for the sympathy.

❝ Any example how two stage adaptive design is applied/validated for parallel study or replicate design?

For the parallel design see Anders’ goody [16] in the post above (methods B and C validated for CV 10–100%, n₁ 48–120; GMR 0.95 and 80% power).
However, if you want something else, you need own simulations to find a suitable adjusted α. Optionally you can also explore a futility criterion for the maximum total sample size. Not complicated in the [image]

-package Power2Stage.
Hint: In all TSDs the maximum inflation of the Type I Error occurs at a combination of low CV and small n₁. Therefore, explore this area first. Once you found a suitable adjusted α, simulate power and the empiric Type I Error for the entire grid. Regulators will ask you for that.

For parallel TSDs there are two functions in Power2Stage, namely power.tsd.pAF()⁠ and power.tsd.p():

Function ⁠power.tsd.pAF()⁠ performs exactly as described in Fuglsang’s paper, namely the power monitoring steps and the sample size estimation are always based on the pooled t-test.
Function power.tsd.p() with argument ⁠test="welch"⁠ on the other hand uses the genuine power of Welch’s test. Moreover it accepts unequal treatment groups in stage 1.

I suggest to opt for the second because – due to dropouts – unequal group sizes are common. A rather lengthy example at the end.

❝ In case of parallel design what kind of statistical factor we need to use Potvin …

Anders used α = 0.294 for the analogues of Potvin’s methods B and C. As usual, a slight inflation of the Type I Error in method C with CV ≤ 20% – which is unlikely in parallel designs anyway. Evaluation by the Welch-Satterthwaite test (for unequal variances and group sizes).

If someone knows what might be meant in the ICH M13A’s Section 2.2.3.4 …

The use of stratification in the randomisation procedure based on a limited number of known relevant factors is therefore recommended. Those factors are also recommended to be accounted for […]

… please enlighten me.

❝ … or Bonferroni …

I think (‼) that it will be an acceptable alternative because it is the most conservative one (strictly speaking, it is not correct in a TSD because the hypotheses are not independent).
Assessors love Signore Bonferroni.

❝ … and same for 4-period or 3-period replicate design?

If you mean reference-scaling, no idea. You can try Bonferroni as well. Recently something was published by the FDA but it is wacky (see this post why I think so). I’m not convinced that it is worth the efforts.
Plan the study for the assumed CV_wR (and the CV_wT if you have the information). In reference-scaling the observed CV_wR is taken into account anyway. If the variability is higher than assumed, you can scale more and will gain power. If it is lower than assumed, bad luck. However, the crucial point is – as always – the GMR…

If you mean by ‘3-period replicate design’ the partial replicate (TRR|RTR|RRT) and want to use the FDA’s RSABE, please don’t (see this article why). It is fine for the EMA’s ABEL. If you want a 3-period replicate for the FDA, please opt for one of the full replicates (TRT|RTR, TTR|RRT, or TRR|RTT). Otherwise, you might be in deep shit.

library(PowerTOST) library(Power2Stage) # start with a fixed sample design CV <- 0.2 # max. inflation of the Type I Error with small CV GMR <- 0.9 # realistic with large CV in parallel designs target <- 0.9 # well… n <- sampleN.TOST(CV = CV, theta0 = GMR, targetpower = target, design = "parallel", print = FALSE)[["Sample size"]] # first stage n1 n1 <- n / 2 # assess the empiric Type I Error at one of the BE-limits (under the Null) # always one mio simulations (very time consuming…) # try a range of adjusted alphas alpha <- seq(0.0292, 0.0306, 0.0001) sig <- binom.test(0.05 * 1e6, 1e6, alternative = "less", conf.level = 0.95)$conf.int[2] alpha <- seq(0.0280, 0.0304, 0.0001) sig <- binom.test(0.05 * 1e6, 1e6, alternative = "less", conf.level = 0.95)$conf.int[2] res <- data.frame(alpha = alpha, TIE = NA_real_, TIE.05 = FALSE, signif = FALSE, TIE.052 = FALSE) # TIE.05 checks whether the TIE > 0.05 # signif checks whether the TIE > the limit of the binomial test for 1 mio sim’s # TIE.052 checks whether the TIE > 0.052 (Potvin’s acceptable inflation) pb <- txtProgressBar(style = 3) for (j in seq_along(alpha)) { res$TIE[j] <- power.tsd.p(method = "B", alpha = rep(alpha[j], 2), n1 = n1, GMR = GMR, CV = CV, targetpower = target, test = "welch", theta0 = 1.25, nsims = 1e6)$pBE if (res$TIE[j] > 0.05) res$TIE.05[j] <- TRUE if (res$TIE[j] > sig) res$signif[j] <- TRUE if (res$TIE[j] > 0.052) res$TIE.052[j] <- TRUE setTxtProgressBar(pb, j / length(alpha)) } close(pb) wary <- which(res$TIE.05 == TRUE & res$TIE.052 == FALSE) # belt plus suspenders (EMA?) res <- res[(head(wary, 1) - 1):(tail(wary, 1) + 1), ] # drop some alphas names(res)[3:5] <- c(">0.05", "* >0.05",">0.052") # cosmetics print(res, row.names = FALSE)

Gives:

alpha TIE >0.05 * >0.05 >0.052 0.0293 0.049518 FALSE FALSE FALSE 0.0294 0.050004 TRUE FALSE FALSE 0.0295 0.050178 TRUE FALSE FALSE 0.0296 0.050182 TRUE FALSE FALSE 0.0297 0.050486 TRUE TRUE FALSE 0.0298 0.050777 TRUE TRUE FALSE 0.0299 0.050772 TRUE TRUE FALSE 0.0300 0.050806 TRUE TRUE FALSE 0.0301 0.050974 TRUE TRUE FALSE 0.0302 0.050890 TRUE TRUE FALSE 0.0303 0.051308 TRUE TRUE FALSE 0.0304 0.051535 TRUE TRUE FALSE 0.0305 0.051616 TRUE TRUE FALSE 0.0306 0.052007 TRUE TRUE TRUE

If you think that a nonsignificant inflation is fine (makes sense, IMHO), use 0.0296 (Type I Error 0.050182 < 0.050360). If you are a disciple of Madame Potvin, even 0.0305 would be OK (0.051616 < 0.052) . Say, you opted for belt plus suspenders 0.0293 (0.049518 < 0.05), planned the first stage with 300 subjects, and observed a CV of 40%. You had some dropouts (15 in one group and 20 in the other). Therefore, instead of n1 = 300, specify n1 = c(135, 130). What can you expect?

power.tsd.p(method = "B", alpha = rep(0.0293, 2), n1 = c(135, 130), GMR = 0.9, CV = 0.4, targetpower = 0.9, npct = c(0.05, 0.25, 0.5, 0.75, 0.95))

Gives:

TSD with 2 parallel groups Method B: alpha (s1/s2) = 0.0293 0.0293 CIs based on Welch's t-test Target power in power monitoring and sample size est. = 0.9 Power calculation via non-central t approx. CV1 and GMR = 0.9 in sample size est. used No futility criterion BE acceptance range = 0.8 ... 1.25 CV = 0.4; ntot(stage 1) = 265 (nT, nR = 135, 130); GMR = 0.9 1e+05 sims at theta0 = 0.9 (p(BE) = 'power'). p(BE) = 0.91405 p(BE) s1 = 0.72275 Studies in stage 2 = 27.73% Distribution of n(total) - mean (range) = 312.8 (265 ... 628) - percentiles 5% 25% 50% 75% 95% 265 265 265 390 472

You have a chance of ≈72% to show BE already in the first stage and ≈91% if you have to proceed to the second (chance ≈28%). As in the Potvin’s methods there is no futility on the total sample size.
However, in this method you can specify one. Say, you don’t want more than 450 subjects:

power.tsd.p(method = "B", alpha = rep(0.0293, 2), n1 = c(135, 130), GMR = 0.9, CV = 0.4, targetpower = 0.9, npct = c(0.05, 0.25, 0.5, 0.75, 0.95), Nmax = 450)

Gives:

TSD with 2 parallel groups Method B: alpha (s1/s2) = 0.0293 0.0293 CIs based on Welch's t-test Target power in power monitoring and sample size est. = 0.9 Power calculation via non-central t approx. CV1 and GMR = 0.9 in sample size est. used Futility criterion Nmax = 450 BE acceptance range = 0.8 ... 1.25 CV = 0.4; ntot(stage 1) = 265 (nT, nR = 135, 130); GMR = 0.9 1e+05 sims at theta0 = 0.9 (p(BE) = 'power'). p(BE) = 0.83875 p(BE) s1 = 0.72275 Studies in stage 2 = 17.91% Distribution of n(total) - mean (range) = 292 (265 ... 450) - percentiles 5% 25% 50% 75% 95% 265 265 265 265 434

Of course, you have the same chance to pass in the first stage as before. But since studies with a total sample size > 450 are considered a failure, less studies proceed to the second stage (≈18% vs ≈28%) and the overall power is lower than without the futility (≈84% vs ≈91%).
Let’s compare now the empiric Type I Errors for both.

sig <- binom.test(0.05 * 1e6, 1e6, alternative = "less", conf.level = 0.95)$conf.int[2] comp <- data.frame(study = c("no futility", "with futility"), TIE = NA_real_, TIE.05 = FALSE, signif = FALSE, TIE.052 = FALSE) for (j in 1:2) { if (comp$study[j] == "no futility") { comp$TIE[j] <- power.tsd.p(method = "B", alpha = rep(0.0293, 2), n1 = c(135, 130), GMR = 0.9, CV = 0.4, targetpower = 0.9, test = "welch", theta0 = 1.25, nsims = 1e6)$pBE } else { comp$TIE[j] <- power.tsd.p(method = "B", alpha = rep(0.0293, 2), n1 = c(135, 130), GMR = 0.9, CV = 0.4, targetpower = 0.9, test = "welch", theta0 = 1.25, nsims = 1e6, Nmax = 450)$pBE } if (comp$TIE[j] > 0.05) comp$TIE.05[j] <- TRUE if (comp$TIE[j] > sig) comp$signif[j] <- TRUE if (comp$TIE[j] > 0.052) comp$TIE.052[j] <- TRUE } names(comp)[3:5] <- c(">0.05", "* >0.05",">0.052") print(comp, row.names = FALSE)

Gives:

study TIE >0.05 * >0.05 >0.052 no futility 0.045936 FALSE FALSE FALSE with futility 0.040638 FALSE FALSE FALSE

Lessons learned: We obtained the adjusted α for CV 20%. For a larger one (here 40%), the Type I Error will be similar or even lower. If we introduce a futility, the Type I Error will always decrease because fewer studies will proceed to the second stage. This holds also for any published method. Therefore, you don’t have to repeat simulations – that’s trivial and can be used as a justification.

A caveat: Actually it is not that simple. In practice you have to repeat this exercise for a range of unequal variances and group sizes in the first stage. It might be that you have to adjust more based on the worst case combination. I did that some time ago. Took me a week, four simultaneous [image]

-sessions, CPU-load close to 90%…

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes