## A rant [Two-Stage / GS Designs]

With ICH M13C on the horizon (initial work is expected to start in December 2024), a publication [1] come to my attention. A great review outlining the current approaches and conditions in various jurisdictions. However, I think that the authors erred in one specific case, which I give below in part. I changed only the numbering of references and added two [5, 9] which are missing in this section of the discussion (pages 14–15 of the PDF).

**Adaptive designs**

[…]

According to Reference [2], if the sample size of both stages is the same, the confidence level does not need to be 95% in both stages, which corresponds to the conservative Bonferroni’s adjusted. Instead, it can be 94.12% (which corresponds to an alpha of 0.0294, since the corresponding confidence level is (100-(2·alpha·100)), as indicated in the EMA guideline [3]. However, that paper refers to parallel designs and it has been highlighted that the value is also unnecessarily conservative for cross-over designs [4]. A confidence level of 93.92% (alpha of 0.0304) could be used in the case of the Pocock design for 2 × 2 cross-over trials. The WHO guideline [5] also includes the possibility of not spending any alpha in the interim analysis after stage 1. This is because BE is not intended to be demonstrated in the interim analysis (i.e., the confidence interval is not calculated). Instead, this information can be used for a futility analysis, which does not use any alpha, i.e., to stop the trial when demonstration of equivalence is unlikely with the pre-defined sample sizes. In Canada and the United States, the recommendation of method C published by Potvin et al is/was an acceptable approach [6]. However, additional papers [7, 8] provided different conclusions due to changes in the assumptions of the simulated scenarios. As a result, these simulations are not considered enough justification of preservation of the overall type I error (alpha) in the EU. In fact, as Maurer et al. [9] explained none of the 4 variations investigated by Potvin et al. formally controls the type I error rate of falsely claiming ABE, even though the amount of inflation produced by Method C was considered acceptable. A major disadvantage of assessing type I error rate inflation using simulation is that without exploring all possible scenarios for the intended design and analysis, it is impossible to ensure control of the type I error rate [10, 11].

- According to Reference [2], if the sample size of both stages is the same, …
Even in a group sequential design (GSD) this is practically never the case. An all too often overlooked detail: If the interim is at < ½

*N*(due to dropouts), one has to use an error-spending function (*e.g.*, Lan and DeMets, Jennison and Turnbull) to control the Type I Error.

- … the confidence level […] can be 94.12% (which corresponds to an alpha of 0.0294, since the corresponding confidence level is (100-(2·alpha·100)), as indicated in the EMA guideline [3].
While correct, that’s a strange way to specify the confidence level in percent; more common is 100(1 – 2

*α*). The guideline’s 94.12% = 100(1 – 2 × 0.0294) is given as an*example*and continues with »… there are many acceptable alternatives and the choice of how much alpha to spend at the interim analysis is at the company’s discretion.«

- … the value is also unnecessarily conservative for cross-over designs [4]. A confidence level of 93.92% (alpha of 0.0304) could be used in the case of the Pocock design for 2 × 2 cross-over trials.

That’s simply wrong. I wish they would have read paper(s) beyond the abstract(s). Pocock’s

*α*= 0.0294 is for*superiority*testing in a parallel GSD with a*fixed*total sample size*N*, assuming equal variances (conventional*t*-tests), and one interim at exactly*N*/2 [2]. Kieser and Rauch [4] didn’t get it right as well:*α*= 0.304 is Pocock’s for*equivalence*but again in a parallel GSD and*not*for a 2×2×2 crossover design. When exploring the supplementary material of [4], in a few cases a slight inflation of the Type I Error can be seen. For the conditions of [6] we found that*α*= 0.0301 is required to control the Type I Error.

BTW, who is interested in a GSD? It requires a best guess CV to estimate*N*, which is – due to the adjusted alpha – always larger than the fixed sample design’s*n*.^{A}

`CV n N n to N`

0.30 156 186 +19.23%

0.35 208 248 +19.23%

0.40 266 318 +19.55%

0.45 332 396 +19.28%

0.50 400 478 +19.50%

IMHO, a GSD would only be advantageous if planned for early stopping in the interim due to futility.

On the other hand, if the CV in the first stage is much lower than the one we assumed, only a few more subjects would be needed to show BE. However, we still have to go full throttle and perform the second stage with*N*/2 subjects. This is both ethically and economically questionable. Essentially we ignore the information of the first stage and rely only on our assumptions. So why not use a fixed sample design in the first place?

- In Canada and the United States, the recommendation of method C published by Potvin et al is/was an acceptable approach [6].
Correct according to the guidances and confirmed by the FDA [12].

- However, additional papers [7, 8] provided different conclusions due to changes in the assumptions of the simulated scenarios.
Of course, different assumptions (GMR, CV, target power) will require different alphas. That’s trivial. What the heck is meant by »different conclusions«?

- As a result, these simulations are not considered enough justification of preservation of the overall type I error (alpha) in the EU.
Well roared lions! I got too many deficiency letters on my desk, which is not surprising [13, 14]. While for the 2×2×2 crossover design the exact method [9] is implemented in the -package Power2Stage [15] since October 2017, only simulation-based methods are available for a parallel design [16, 17].

- In fact, as Maurer et al. [9] explained none of the 4 variations investigated by Potvin et al. formally controls the type I error rate of falsely claiming ABE, even though the amount of inflation produced by Method C was considered acceptable.
Of course, there is no formal proof of controlling the Type I Error in simulation-based methods.

Maurer et al. [9] state correctly »The conclusion we want to emphasize is that if the parameters of the design are changed, there is no guarantee with Method B or any method that uses a TOST procedure on the pooled data from both stages as a second stage test, that the type I error rate will not be inflated above the nominal value.« (my emphasis)

»[…] the method should be prespecified. This study did not seek to find the best possible two-stage design, but rather to find good ones that could be used by sponsors without further validation.« [6] In other words, if one wants something which is not published and validated, own simulations are required. There is no free lunch.

Any modification of a simulation-based method simply must not be done. Nevertheless, the authors modified (GMR 0.85, power 90%) ‘method B’ and therefore, observed an inflated Type I Error in Figure 6. That’s (i) not surprising because such a setup is not covered by*this*framework (GMR 0.95, power 80%) [6] and (ii) thus, cannot be used as an argument against simulation-based methods. Different GMR and/or power require a*different*adjusted*α*[18, 19]. Easy to show. Although not necessary for [9], we can also assess its empiric Type I Error [15].^{B}

`setup CV n1 alpha GMR power TIE`

Potvin et al. [6] method B 0.4 36 0.0294 0.95 0.8 0.04871

Molins et al. [19] Type 1 0.4 36 0.0301 0.95 0.8 0.04984

Maurer et al. [9] Figure 5 0.4 36 0.0263 0.95 0.8 0.04538

Maurer et al. [9] Figure 6 0.4 36 0.0294 0.85 0.9 0.05524

Molins et al. [19] Type 1 0.4 36 0.0262 0.85 0.9 0.04889

Maurer et al. [9] 0.4 36 0.0263 0.85 0.9 0.04528

More adjustment (*i.e.*, smaller*α*→ wider CI) has to be used if the GMR moves away from 1 and/or the desired power increases. Simple.

- A major disadvantage of assessing type I error rate inflation using simulation is that without exploring all possible scenarios for the intended design and analysis, …
What might be meant by »all possible scenarios«? An infinitesimally small step size for the CV and

*n*_{1}with a step size of 1?

How many simulations are considered sufficient? One million (which is the current standard) or even more?

I saw deficiency letters stating that the grid of [6] (step size for CV 10% and for*n*_{1}12) was too ‘coarse’. Come on! Easily doable with an extremely narrow grid size. The surfaces in the right panel of the plot below show the power in the first stage (lower one) and the final analysis (upper one).

What do regulators suspect for simulation-based methods? That there is a sudden ‘spike’ 🚀 in the Type I Error surface at the CV and*n*_{1}observed in the study? Gimme a break! In two deficiency letters assessors required a*post hoc*evaluation of the empiric Type I Error. See also this discouraging post and that one.

- … it is impossible to ensure control of the type I error rate [10, 11].
I fail to find such a statement in both references. Sorry folks, are you on a mission?

- […] the current review publication […] from the IPRP BEWGG describe several topics that will be addressed under future ICH projects (e.g., ICH M13B and M13C guidelines). It is evident that the continued efforts of the IPRP BEWGG will have a notable impact with respect to promoting collaboration and achieving regulatory convergence and harmonization in the field of BE study performance.
I sincerely hope that the ICH M13 Expert Working Group has more statistical expertise than the 30 (‼) authors of this publication.

I became interested in adaptive designs for bioequivalence almost thirty years ago [22, 23]. It is disheartening to observe the lack of advancement and the prevalence of misinterpretations of methods [24].

- Freitas Fernandes EA, van Oudtshoorn J, Tam A, Arévalo González LA, Aurela EG, Potthast H, Mettke K, Kuribayashi R, Shimojo K, Kasuga M, Morales L, Rodríguez Z, Jones B, Ahn C, Yun E, Kim SH, Rodrigues C, Tiong T, Crane C, Walther C, Roost MS, Chen T-L, Hsu L-f, Braddy AC, García-Arieta A, Abalos I, Divinsky M, Alsuwyeh A, Alzenaidy B, Alhar A.
*The bioequivalence study design recommendations for immediate-release solid oral dosage forms in the international pharmaceutical regulators programme participating regulators and organisations: differences and commonalities.*J Pharm Pharmaceut Sci. 2024; 27: 12398. doi:10.3389/jpps.2024.12398

- Pocock SJ.
*Group sequential methods in the design and analysis of clinical trials.*Biometrika. 1977; 64(2): 191–9. doi:10.2307/2335684

- EMA (CHMP).
*Guideline on the Investigation of Bioequivalence.*London. 20 January 2010.

- Kieser M, Rauch G.
*Two-stage designs for cross-over bioequivalence trials.*Stat Med. 2015; 34(16): 2403–16. doi:10.1002/sim.6487

- WHO Expert Committee on Specifications for Pharmaceutical Preparations.
*Multisource (generic) pharmaceutical products: guidelines on registration requirements to establish interchangeability.*Fifty-first report. Technical Report Series, No. 992, Annex 6. Geneva. April 2017.

- Potvin D, DiLiberti CE, Hauck WW, Parr AF, Schuirmann DJ, Smith RA.
*Sequential design approaches for bioequivalence studies with crossover designs.*Pharm Stat. 2008; 7: 245–62. doi:10.1002/pst.294.

- Montague TH, Potvin D, DiLiberti CE, Hauck WW, Parr AF, DJ Schuirmann DJ.
*Additional results for ‘Sequential design approaches for bioequivalence studies with crossover designs’.*Pharm Stat. 2011; 11: 8–13. doi:10.1002/pst.483

- Xu J, Audet C, DiLiberti CE, Hauck WW, Montague TH, Parr TH, Potvin D, Schuirmann DJ.
*Optimal adaptive sequential designs for crossover bioequivalence studies.*Pharm Stat. 2016; 15(1): 15–27. doi:10.1002/pst.1721

- Maurer W, Jones B, Chen Y.
*Controlling the type 1 error rate in two-stage sequential designs when testing for average bioequivalence.*Stat Med. 2018; 37(10): 1–21. doi:10.1002/sim.7614.

- South African Health Products Regulatory Authority.
*Quality and Bioequivalence Guideline.*Pretoria. 2023.

- Lee J, Feng K, Xu M, Gong X, Sun W, Kim J, Zhang Z, Wang M, Fang L, Zhao L.
*Applications of adaptive designs in generic drug development*. Clin Pharm Ther. 2020; 110(1): 32–5. doi:10.1002/cpt.2050

- Jiang X.
*Adaptive design and alpha adjustment: FDA position.*Presentation at: 5^{th}GBHI conference. Amsterdam. 28 September 2022.

- Kaza M, Sokolowki A, Rudzki PJ.
*10th Anniversary of a Two-Stage Design in Bioequivalence. Why Has it Still Not Been Implemented?*Pharm Res. 2020; 37(7): 140, doi:10.1007/s11095-020-02871-3

- Urach S.
*Two stage Designs and their Acceptability in the EC Area.*Presentation at: 6^{th}International Workshop – GBHI 2024. Rockville, MD. 16 April 2024.

- Labes D, Lang B, Schütz H.
*Power2Stage: Power and Sample-Size Distribution of 2-Stage Bioequivalence Studies.*Package version 0.5-4. 2021-11-20. https://cran.r-project.org/package=Power2Stage

- Fuglsang A.
*Sequential Bioequivalence Approaches for Parallel Design.*AAPS J. 2014; 16(3): 373–8. doi:10.1208/s12248-014-9571-1

- Schütz H. N
*ovel approaches in adaptive designs and α adjustment, e.g., with futility criteria and for parallel design studies.*Presentation at: 5^{th}GBHI conference. Amsterdam. 28 September 2022.

- Fuglsang A.
*Sequential Bioequivalence Trial Designs with Increased Power and Controlled Type I Error Rates.*AAPS J. 2013; 15: 659–61. doi:10.1208/s12248-013-9475-5

- Molins E, Labes D, Schütz H, Cobo E, Ocaña J.
*An iterative method to protect the type I error rate in bioequivalence studies under two-stage adaptive 2×2 crossover designs.*Biom J. 2021; 63(1): 122–33. doi:10.1002/bimj.201900388

- König F, Wolfsegger M, Jaki T, Schütz H, Wassmer G.
*Adaptive two-stage bioequivalence trials with early stopping and sample size re-estimation.*Vienna: 2014; 35^{th}Annual Conference of the International Society for Clinical Biostatistics. Poster P1.2.88. doi:10.13140/RG.2.1.5190.0967

- König F, Wolfsegger M, Jaki T, Schütz H, Wassmer G.
*Adaptive two-stage bioequivalence trials with early stopping and sample size re-estimation.*Trials. 2015; 16(Suppl 2); P218. doi:10.1186/1745-6215-16-S2-P218

- Gould AL.
*Group Sequential Extensions of a Standard Bioequivalence Testing Procedure.*J Pharmacokin Biopharm. 1995; 23(1): 57–86. doi:10.1007/bf02353786

- Hauck WW, Preston PE, Bois FY.
*A group sequential approach to crossover trials for average bioequivalence.*J of Biopharm Stat. 1997; 7(1): 87-96. doi:10.1080/10543409708835171

- Schütz H.
*Two-stage designs in bioequivalence trials.*Eur J Clin Pharmacol. 2015; 71(3): 271–81. doi:10.1007/s00228-015-1806-2

-scripts:

- Comparison of fixed sample (
*n*) and GSD (*N*):

`library(PowerTOST)`

alpha <- c(0.05, 0.0304) # 2nd element: Pocock’s for equivalence, parallel design

theta0 <- 0.90 # reasonable, given the high CVs

CV <- seq(0.3, 0.5, 0.05)

x <- data.frame(CV = CV, n = NA_integer_, N = NA_integer_)

for (j in seq_along(CV)) {

x$n[j] <- sampleN.TOST(alpha = alpha[1], CV = CV[j], theta0 = theta0,

design = "parallel", print = FALSE)[["Sample size"]]

x$N[j] <- sampleN.TOST(alpha = alpha[2], CV = CV[j], theta0 = theta0,

design = "parallel", print = FALSE)[["Sample size"]]

}

x$penalty <- sprintf("%+.2f%%", 100 * (x$N / x$n - 1))

names(x)[4] <- "n to N"

print(x, row.names = FALSE)

- Misspecification of GMR and power in method B (only valid for GMR 0.95 and power 80%) by Maurer et al. in Figure 6, correct alphas are 0.0262 in both stages. Assessment of the exact method by one mio simulations for comparison.

`library(Power2Stage)`

x <- data.frame(setup = c("Potvin et al. [6] method B",

"Molins et al. [19] Type 1",

"Maurer et al. [9] Figure 5",

"Maurer et al. [9] Figure 6",

"Molins et al. [19] Type 1",

"Maurer et al. [9]"),

CV = rep(0.4, 6), n1 = rep(36, 6),

alpha = c(0.0294, 0.0301, NA, 0.0294, 0.0262, NA),

GMR = c(rep(0.95, 3), rep(0.85, 3)),

power = c(rep(0.8, 3), rep(0.9, 3)))

for (j in 1:6) {

if (j %in% c(1:2, 4:5)) { # method B / Type 1

x$TIE[j] <- power.tsd(method = "B", alpha = rep(x$alpha[j], 2),

CV = x$CV[j], n1 = x$n1[j], GMR = x$GMR[j],

theta0 = 0.8, targetpower = x$power[j])$pBE

} else { # Maximum Combination Test with futility

y <- power.tsd.in(CV = x$CV[j], n1 = x$n1[j], GMR = x$GMR[j],

theta0 = 0.8, targetpower = x$power[j])

x$alpha[j] <- signif(y$alpha[1], 3)

x$TIE[j] <- y$pBE

}

}

x$TIE <- signif(x$TIE, 4)

print(x, row.names = FALSE, right = FALSE)

*Dif-tor heh smusma*🖖🏼 Довге життя Україна!

_{}

Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮

Science Quotes

### Complete thread:

- A rantHelmut 2024-08-26 12:52 [Two-Stage / GS Designs]
- A rant Achievwin 2024-08-28 02:16
- Parallel (replicate?) Helmut 2024-08-28 10:54

- A rant Achievwin 2024-08-28 02:16