## But what is the real problem? [Two-Stage / GS Designs]

Hi Mikalai,

» We have a drug and do not know it CV. We take an arbitrary sample …

Well, I always try to make an educated guess of the CV. If you aim too low in the first stage, the sample size penalty in the second stage will be larger.
Example: “Guesstimate” CV 25%, Potvin B, n1 12 or 24.

library(Power2Stage) power.tsd(CV=0.25, n1=12) TSD with 2x2 crossover Method B: alpha (s1/s2) = 0.0294 0.0294 Target power in power monitoring and sample size est. = 0.8 Power calculation via non-central t approx. CV1 and GMR = 0.95 in sample size est. used No futility criterion BE acceptance range = 0.8 ... 1.25 CV = 0.25; n(stage 1) = 12; GMR = 0.95 1e+05 sims at theta0 = 0.95 (p(BE) = 'power'). p(BE)    = 0.81438 p(BE) s1 = 0.17895 Studies in stage 2 = 81.38% Distribution of n(total) - mean (range) = 32.4 (12 ... 126) - percentiles  5% 50% 95%  12  32  60 power.tsd(CV=0.25, n1=24) TSD with 2x2 crossover Method B: alpha (s1/s2) = 0.0294 0.0294 Target power in power monitoring and sample size est. = 0.8 Power calculation via non-central t approx. CV1 and GMR = 0.95 in sample size est. used No futility criterion BE acceptance range = 0.8 ... 1.25 CV = 0.25; n(stage 1) = 24; GMR = 0.95 1e+05 sims at theta0 = 0.95 (p(BE) = 'power'). p(BE)    = 0.84244 p(BE) s1 = 0.63203 Studies in stage 2 = 33.56% Distribution of n(total) - mean (range) = 29 (24 ... 86) - percentiles  5% 50% 95%  24  24  48

With 12 subjects in the first stage on the average you will have a total sample size of 32.4 (median 32) and with 24 only 29 (median 24). In the latter case you have already a chance of 63% to show BE in the first stage and in the former only 18%.

» … and then calculate the CV and real GMR in the first stage.

Yep. But in ‘Type 1’ TSDs you generally ignore the observed GMR and work with a fixed (assumed) T/R-ratio.

» In the second stage, we, if I understand correctly, should calculate post-hoc power and recalculate the sample size with the data (GMR and CV) obtained in the first stage, if bioequivalence has not been achieved in the first stage.

Nope. You calculate interim power after the first stage. If you want to use the GMR of the first stage as well (go fully adaptive) you might shoot yourself in the foot. Practically you need two futility criteria:
1. Stop if the GMR is outside [0.80, 1.25].
2. Stop if the re-estimated sample size is above a pre-specified limit (U).
The methods of Karalis & Macheras might (!) have terrible power making their application ethically doubtful. Here it works with U 120 if n1 24 but not with n1 12 (power only ~73%). There are alternatives where you don’t stop if n1+n2>U but perform the the second stage in U–n2 subjects. No problem with the Type I Error but might compromise power; I suggest simulations.

» But what to do, if we have got bad GMR(0,83), whatever CV and low power (around 30) in the first stage.

You are free to include futility criteria for early stopping in the method. You don’t have to worry about the adjusted α because any futility criterion decreases the patient’s risk.

» In this case, as I understand, we should recruit much more subjects according to our protocol.

In general you should not give a total sample size in the protocol – unless it is part of the framework (simulations recommended: have an eye on power).

If you are courageous try the Inverse-Normal Combination Method / Maximum Combination Test (Maurer et al. 2018).
Pro: Proven to preserve the Type I Error (makes regulatory statisticians happy).
Con: Might be the first time they ever have seen sumfink like this. Expect questions.
Example: Like above but two futility criteria: GMR within [0.8, 1.25] and maximum total sample size 120. GMR observed in the first stage used (fully adaptive).

library(Power2Stage) power.tsd.in(CV=0.25, n1=24, usePE=TRUE, fClower=0.8, fCupper=1.25,              fCNmax=120, fCrit=c("PE", "Nmax")) TSD with 2x2 crossover Inverse Normal approach  - maximum combination test (weights = 0.5 0.25)  - alpha (s1/s2) = 0.02635 0.02635  - critical value (s1/s2) = 1.93741 1.93741  - with conditional error rates and conditional power Overall target power = 0.8 Threshold in power monitoring step for futility = 0.8 Power calculation via non-central t approx. CV1 and PE1 in sample size est. used Futility criterion Nmax = 120 Futility criterion PE outside 0.8 ... 1.25 Minimum sample size in stage 2 = 4 BE acceptance range = 0.8 ... 1.25 CV = 0.25; n(stage 1) = 24; GMR = 0.95 1e+05 sims at theta0 = 0.95 (p(BE) = 'power'). p(BE)    = 0.84312 p(BE) s1 = 0.60784 Studies in stage 2 = 30.03% Distribution of n(total) - mean (range) = 31.3 (24 ... 120) - percentiles  5% 50% 95%  24  24  68

» The questions, how can we avoid slipping into "forced bioequivalence"? Or should we go straight away and recruit this large number of subjects? What should be put into the protocol?

As ElMaestro wrote above, I don’t see how you could run into “forced BE”.
• No defined total sample size. You perform the second stage in the re-estimated sample size.
• If at the end the CV is lower and/or the GMR “better” than assumed, be happy. Post hoc power is irrelevant. Open a bottle of champagne.

Dif-tor heh smusma 🖖
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes