Parallel (replicate?) [Two-Stage / GS Designs]
❝ I feel your pain…
❝ Any example how two stage adaptive design is applied/validated for parallel study or replicate design?
However, if you want something else, you need own simulations to find a suitable adjusted α. Optionally you can also explore a futility criterion for the maximum total sample size. Not complicated in the -package
.Hint: In all TSDs the maximum inflation of the Type I Error occurs at a combination of low CV and small n1. Therefore, explore this area first. Once you found a suitable adjusted α, simulate power and the empiric Type I Error for the entire grid. Regulators will ask you for that.
For parallel TSDs there are two functions in
, namely power.tsd.pAF()
and power.tsd.p()
:- Function
performs exactly as described in Fuglsang’s paper, namely the power monitoring steps and the sample size estimation are always based on the pooled t-test.
- Function
with argumenttest="welch"
on the other hand uses the genuine power of Welch’s test. Moreover it accepts unequal treatment groups in stage 1.
❝ In case of parallel design what kind of statistical factor we need to use Potvin …
Anders used α = 0.294 for the analogues of Potvin’s methods B and C. As usual, a slight inflation of the Type I Error in method C with CV ≤ 20% – which is unlikely in parallel designs anyway. Evaluation by the Welch-Satterthwaite test (for unequal variances and group sizes).
If someone knows what might be meant in the ICH M13A’s Section …
The use of stratification in the randomisation procedure based on a limited number of known relevant factors is therefore recommended. Those factors are also recommended to be accounted for […]
… please enlighten me.❝ … or Bonferroni …
I think (‼) that it will be an acceptable alternative because it is the most conservative one (strictly speaking, it is not correct in a TSD because the hypotheses are not independent).
Assessors love Signore Bonferroni.
❝ … and same for 4-period or 3-period replicate design?
If you mean reference-scaling, no idea. You can try Bonferroni as well. Recently something was published by the FDA but it is wacky (see this post why I think so). I’m not convinced that it is worth the efforts.
Plan the study for the assumed CVwR (and the CVwT if you have the information). In reference-scaling the observed CVwR is taken into account anyway. If the variability is higher than assumed, you can scale more and will gain power. If it is lower than assumed, bad luck. However, the crucial point is – as always – the GMR…
If you mean by ‘3-period replicate design’ the partial replicate (TRR|RTR|RRT) and want to use the FDA’s RSABE, please don’t (see this article why). It is fine for the EMA’s ABEL. If you want a 3-period replicate for the FDA, please opt for one of the full replicates (TRT|RTR, TTR|RRT, or TRR|RTT). Otherwise, you might be in deep shit.
# start with a fixed sample design
CV <- 0.2 # max. inflation of the Type I Error with small CV
GMR <- 0.9 # realistic with large CV in parallel designs
target <- 0.9 # well…
n <- sampleN.TOST(CV = CV, theta0 = GMR, targetpower = target,
design = "parallel", print = FALSE)[["Sample size"]]
# first stage n1
n1 <- n / 2
# assess the empiric Type I Error at one of the BE-limits (under the Null)
# always one mio simulations (very time consuming…)
# try a range of adjusted alphas
alpha <- seq(0.0292, 0.0306, 0.0001)
sig <- binom.test(0.05 * 1e6, 1e6, alternative = "less",
conf.level = 0.95)$[2]
alpha <- seq(0.0280, 0.0304, 0.0001)
sig <- binom.test(0.05 * 1e6, 1e6, alternative = "less",
conf.level = 0.95)$[2]
res <- data.frame(alpha = alpha, TIE = NA_real_, TIE.05 = FALSE,
signif = FALSE, TIE.052 = FALSE)
# TIE.05 checks whether the TIE > 0.05
# signif checks whether the TIE > the limit of the binomial test for 1 mio sim’s
# TIE.052 checks whether the TIE > 0.052 (Potvin’s acceptable inflation)
pb <- txtProgressBar(style = 3)
for (j in seq_along(alpha)) {
res$TIE[j] <- power.tsd.p(method = "B", alpha = rep(alpha[j], 2), n1 = n1,
GMR = GMR, CV = CV, targetpower = target,
test = "welch", theta0 = 1.25, nsims = 1e6)$pBE
if (res$TIE[j] > 0.05) res$TIE.05[j] <- TRUE
if (res$TIE[j] > sig) res$signif[j] <- TRUE
if (res$TIE[j] > 0.052) res$TIE.052[j] <- TRUE
setTxtProgressBar(pb, j / length(alpha))
wary <- which(res$TIE.05 == TRUE & res$TIE.052 == FALSE) # belt plus suspenders (EMA?)
res <- res[(head(wary, 1) - 1):(tail(wary, 1) + 1), ] # drop some alphas
names(res)[3:5] <- c(">0.05", "* >0.05",">0.052") # cosmetics
print(res, row.names = FALSE)
alpha TIE >0.05 * >0.05 >0.052
0.0293 0.049518 FALSE FALSE FALSE
0.0294 0.050004 TRUE FALSE FALSE
0.0295 0.050178 TRUE FALSE FALSE
0.0296 0.050182 TRUE FALSE FALSE
0.0297 0.050486 TRUE TRUE FALSE
0.0298 0.050777 TRUE TRUE FALSE
0.0299 0.050772 TRUE TRUE FALSE
0.0300 0.050806 TRUE TRUE FALSE
0.0301 0.050974 TRUE TRUE FALSE
0.0302 0.050890 TRUE TRUE FALSE
0.0303 0.051308 TRUE TRUE FALSE
0.0304 0.051535 TRUE TRUE FALSE
0.0305 0.051616 TRUE TRUE FALSE
0.0306 0.052007 TRUE TRUE TRUE
(Type I Error 0.050182 < 0.050360
). If you are a disciple of Madame Potvin, even 0.0305
would be OK (0.051616 < 0.052
) . Say, you opted for belt plus suspenders 0.0293
(0.049518 < 0.05
), planned the first stage with 300 subjects, and observed a CV of 40%. You had some dropouts (15 in one group and 20 in the other). Therefore, instead of n1 = 300,
specify n1 = c(135, 130)
. What can you expect?power.tsd.p(method = "B", alpha = rep(0.0293, 2), n1 = c(135, 130),
GMR = 0.9, CV = 0.4, targetpower = 0.9,
npct = c(0.05, 0.25, 0.5, 0.75, 0.95))
TSD with 2 parallel groups
Method B: alpha (s1/s2) = 0.0293 0.0293
CIs based on Welch's t-test
Target power in power monitoring and sample size est. = 0.9
Power calculation via non-central t approx.
CV1 and GMR = 0.9 in sample size est. used
No futility criterion
BE acceptance range = 0.8 ... 1.25
CV = 0.4; ntot(stage 1) = 265 (nT, nR = 135, 130); GMR = 0.9
1e+05 sims at theta0 = 0.9 (p(BE) = 'power').
p(BE) = 0.91405
p(BE) s1 = 0.72275
Studies in stage 2 = 27.73%
Distribution of n(total)
- mean (range) = 312.8 (265 ... 628)
- percentiles
5% 25% 50% 75% 95%
265 265 265 390 472
However, in this method you can specify one. Say, you don’t want more than 450 subjects:
power.tsd.p(method = "B", alpha = rep(0.0293, 2), n1 = c(135, 130),
GMR = 0.9, CV = 0.4, targetpower = 0.9,
npct = c(0.05, 0.25, 0.5, 0.75, 0.95), Nmax = 450)
TSD with 2 parallel groups
Method B: alpha (s1/s2) = 0.0293 0.0293
CIs based on Welch's t-test
Target power in power monitoring and sample size est. = 0.9
Power calculation via non-central t approx.
CV1 and GMR = 0.9 in sample size est. used
Futility criterion Nmax = 450
BE acceptance range = 0.8 ... 1.25
CV = 0.4; ntot(stage 1) = 265 (nT, nR = 135, 130); GMR = 0.9
1e+05 sims at theta0 = 0.9 (p(BE) = 'power').
p(BE) = 0.83875
p(BE) s1 = 0.72275
Studies in stage 2 = 17.91%
Distribution of n(total)
- mean (range) = 292 (265 ... 450)
- percentiles
5% 25% 50% 75% 95%
265 265 265 265 434
Let’s compare now the empiric Type I Errors for both.
sig <- binom.test(0.05 * 1e6, 1e6, alternative = "less",
conf.level = 0.95)$[2]
comp <- data.frame(study = c("no futility", "with futility"),
TIE = NA_real_, TIE.05 = FALSE,
signif = FALSE, TIE.052 = FALSE)
for (j in 1:2) {
if (comp$study[j] == "no futility") {
comp$TIE[j] <- power.tsd.p(method = "B", alpha = rep(0.0293, 2),
n1 = c(135, 130), GMR = 0.9, CV = 0.4,
targetpower = 0.9, test = "welch",
theta0 = 1.25, nsims = 1e6)$pBE
} else {
comp$TIE[j] <- power.tsd.p(method = "B", alpha = rep(0.0293, 2),
n1 = c(135, 130), GMR = 0.9, CV = 0.4,
targetpower = 0.9, test = "welch",
theta0 = 1.25, nsims = 1e6, Nmax = 450)$pBE
if (comp$TIE[j] > 0.05) comp$TIE.05[j] <- TRUE
if (comp$TIE[j] > sig) comp$signif[j] <- TRUE
if (comp$TIE[j] > 0.052) comp$TIE.052[j] <- TRUE
names(comp)[3:5] <- c(">0.05", "* >0.05",">0.052")
print(comp, row.names = FALSE)
study TIE >0.05 * >0.05 >0.052
no futility 0.045936 FALSE FALSE FALSE
with futility 0.040638 FALSE FALSE FALSE
A caveat: Actually it is not that simple. In practice you have to repeat this exercise for a range of unequal variances and group sizes in the first stage. It might be that you have to adjust more based on the worst case combination. I did that some time ago. Took me a week, four simultaneous -sessions, CPU-load close to 90%…
