Ohlbe
★★★

France,
2020-12-04 12:04
(320 d 07:54 ago)

Posting: # 22112
Views: 2,432

## WHO frequent deficiencies in protocols and reports [BE/BA News]

Dear all,

The WHO has published a 7-page document listing frequent deficiencies in BE protocols, reports and practices.

Helmut will be happy with at least one recommendation: no need to calculate and report post-hoc power. I expect some others to stir some discussions here (e.g. 2-stage design) Regards
Ohlbe
ElMaestro
★★★

Denmark,
2020-12-04 14:10
(320 d 05:48 ago)

@ Ohlbe
Posting: # 22113
Views: 2,142

## WHO frequent deficiencies in protocols and reports

Thanks Ohlbe,

does anyone know what this sentence means: "The calculation of the 90% confidence interval (CI) of the mean test/comparator ratio for the primary PK parameters should not be confused with the two one-sided t-tests employed to reject the null hypothesis of non-equivalence. The end result is the same, but these are not the same calculations." Pass or fail!
ElMaestro
Helmut
★★★  Vienna, Austria,
2020-12-04 20:30
(319 d 23:28 ago)

@ ElMaestro
Posting: # 22115
Views: 2,134

## CI inclusion ≠ TOST

Hi ElMaestro,

» does anyone know what this sentence means:

Yes. » "The calculation of the 90% confidence interval (CI) of the mean test/comparator ratio for the primary PK parameters‚…

$$\small{H_0:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\notin \left [ \theta_1, \theta_2 \right]\:vs\:H_1:\theta_1<\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2}\tag{1}$$

» … should not be confused with the two one-sided t-tests employed to reject the null hypothesis of non-equivalence…

$$\small{H_\textrm{0L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}} \leq \theta_1\:vs\:H_\textrm{1L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}>\theta_1}\tag{2a}$$ $$\small{H_\textrm{0U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}} \geq \theta_2\:vs\:H_\textrm{1U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2}\tag{2b}$$
» … The end result is the same, but these are not the same calculations." Exactly! For decades global guidelines ask for for the confidence interval inclusion approach $$\small{(1)}$$. In Schuirmann’s famous TOST procedure $$\small{(2)}$$ one gets two p-values; one for $$\small{(2\textrm{a})}$$ and another one for $$\small{(2\textrm{b})}$$. Nothing else. BE is concluded if both Nulls are rejected.
I think that statisticians of the WHO are fed up reading in $$\frac{\mathfrak{protocols}}{\mathfrak{reports}}$$ …

Bioequivalence $$\frac{\textrm{will be}}{\textrm{has been}}$$ assessed by the Two-One-Sided Tests procedure (Schuirmann 1987).

… only to find the 90% CI in the report.
BTW, only once I have seen TOST performed (in 1991). Lead to a deficiency letter: “The applicant should provide the 90% CI.”

Chow and Liu1 erred when stating

The two one-sided t tests procedure is operationally equivalent* to the classic (shortest) confidence interval approach; that is, if the classic (1–α)×100% confidence interval for µTµR is within (θL, θU), then both H01 and H02 are also rejected at the α level by the two one-sided t tests procedure.

I would not go that far like Brown et al.2 stating that

This similarity [between level α TOSTs and a 1–2α CI] is somewhat of a fiction, based more on an algebraic coincidence rather than a statistical equivalence.
[my insert]

More details given by Berger and Hsu.3 Already in the abstract:

The misconception that size-α bioequivalence tests generally correspond to 100(1–2α)% confidence sets […] lead[s] to incorrect statistical practices, and should be abandoned.

When reviewing stuff I insist in deleting the – all too common – TOST-statement as well (i.e., claiming $$\small{(2)}$$ whilst performing $$\small{(1)}$$).

• I never understood what “operationally equivalent” means. Technically calculating 2 one sided t-tests is not the same as a two-sided 90% CI. Or do they mean hitting once the models are set up in the preferred software?
You can’t really say “similar” if it’s the same again you want.
“Similar” means something different.
Anthony Burgess (Enderby Outside, 1968)

1. Chow S-C, Liu J-p. Design and Analysis of Bioavailability and Bioequivalence Studies. Boca Raton: Chapman & Hall / CRC Press; 3rd ed. 2009. p. 98.
2. Brown LD., Casella G, Hwang JTG. Optimal Confidence Sets, Bioequivalence, and the Limaçon of Pascal. J Am Stat Assoc. 1995;90(431):880–9. doi:10.2307/2291322. Open access.
3. Berger RL, Hsu JC. Bioequivalence Trials, Intersection–Union Tests and Equivalence Confidence Sets. Stat Sci. 1996;11(4):283–319. doi:10.1214/ss/1032280304. Open access.

Dif-tor heh smusma 🖖
Helmut Schütz The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
d_labes
★★★

Berlin, Germany,
2020-12-05 16:19
(319 d 03:39 ago)

@ Helmut
Posting: # 22117
Views: 2,046

## CI inclusion ≠ TOST

Dear Helmut,

if the CI inclusion rule is really something different than TOST why do we calculate power / samplesize based on TOST .

Regards,

Detlew
ElMaestro
★★★

Denmark,
2020-12-05 16:49
(319 d 03:09 ago)

@ d_labes
Posting: # 22118
Views: 2,056

## CI inclusion ≠ TOST

Haha,

» if the CI inclusion rule is really something different than TOST why do we calculate power / samplesize based on TOST .

d_labes, you beat me to it, I was going to ask a similar question. If I am getting it right this isn't about whether your product passes the test for BE or not, it is purely a matter relating to what you call it. Since semantics is now of such importance, I believe PowerTOST needs to be renamed. Now someone kindly define robust and robustness for me so that I understand it. And please tell me how to use that definition to make a simulation robust enough that it convinces WHO Or more generally, if I want to present an argument to a WHO regulator (not a simulation but just an argument which may or may not be based on siumulation) in which way will I know away to make my argument robust?

Pass or fail!
ElMaestro
d_labes
★★★

Berlin, Germany,
2020-12-07 11:39
(317 d 08:19 ago)

@ ElMaestro
Posting: # 22121
Views: 1,998

## CI inclusion ≠ TOST

Dear ElMaestro,

» If I am getting it right this isn't about whether your product passes the test for BE or not, it is purely a matter relating to what you call it. Since semantics is now of such importance, I believe PowerTOST needs to be renamed. Can't understand what the Dichter damit sagen will .

Regards,

Detlew
Helmut
★★★  Vienna, Austria,
2020-12-06 22:09
(317 d 21:49 ago)

@ d_labes
Posting: # 22119
Views: 2,013

## CI inclusion ≠ TOST

Dear Detlew,

» if the CI inclusion rule is really something different than TOST why do we calculate power / samplesize based on TOST .

Cause you decided to baptize the package PowerTOST and not “Eierlegende Wollmilchsau” back in 2009. Given, Donald used the phrase “operationally identical” on p.661 (right column, 2nd paragraph).

However, for me (!) those are two different “operations”. Results of an example:

theta1 = 0.8000 theta2 = 1.2500 alpha  = 0.05 90% CI: lower CL = 0.8448         upper CL = 1.1003         CI within 0.8000 and 1.2500: passed BE TOST  : p(<0.8000) = 0.01239         p(>1.2500) = 0.001565         p(<0.8000) <0.05 and p(>1.2500) <alpha: passed BE Checks with library PowerTOST CI.BE(alpha = 0.05, ...):   lower   = 0.8448   upper   = 1.1003 pvalues.TOST():   p.left  = 0.01239   p.right = 0.001565

set.seed(123456789) alpha  <- 0.05 theta0 <- 0.95 theta1 <- 0.80 theta2 <- 1.25 CV     <- 0.20 subj   <- 1:20 per    <- 1:2 seq    <- c("TR", "RT") trt    <- c("T", "R") data   <- data.frame(subj = as.factor(rep(subj, each = 2)),                      per  = as.factor(rep(per, 2)),                      seq  = as.factor(rep(seq, each = length(subj))),                      trt  = as.factor(c(rep(trt, length(subj)/2),                                         rep(rev(trt), length(subj)/2)))) data$PK[data$trt == "R"] <- rlnorm(n = length(subj),                                    meanlog = log(1)-0.5*log(CV^2+1),                                    sdlog = sqrt(log(CV^2+1))) data$PK[data$trt == "T"] <- rlnorm(n = length(subj),                                    meanlog = log(theta0)-0.5*log(CV^2+1),                                    sdlog = sqrt(log(CV^2+1))) mod    <- lm(log(PK) ~ seq + subj + per + trt, data = data) ######################### # CI inclusion (base R) # ######################### pe     <- exp(coef(mod)[["trtT"]]) ci     <- exp(as.numeric(confint(mod, "trtT", level = 1-2*alpha))) names(ci) <- c("lower", "upper") if (ci[["lower"]] >= 0.80 & ci[["upper"]] <=1.25) {   BE.ci <- sprintf("CI within %.4f and %.4f: %s",                    theta1, theta2, "passed BE") } if (ci[["lower"]] < 0.80 & ci[["upper"]] <= 1.25) {   BE.ci <- sprintf("lower CL <%.4f, upper CL \u2264%.4f: %s",                    theta1, theta2, "failed BE (inconclusive)") } if (ci[["lower"]] >= 0.80 & ci[["upper"]] > 1.25) {   BE.ci <- sprintf("lower CL \u2265%.4f, upper CL >%.4f: %s",                    theta1, theta2, "failed BE (inconclusive)") } if (ci[["lower"]] < 0.80 & ci[["upper"]] > 1.25) {   BE.ci <- sprintf("lower CL <%.4f and upper CL >%.4f: %s",                    theta1, theta2, "inequivalent") } ################# # TOST (base R) # ################# s      <- sqrt(anova(mod)["Residuals", "Mean Sq"]) nu     <- anova(mod)["Residuals", "Df"] n1     <- length(as.numeric(data$subj[data$seq == "TR"]))/2 n2     <- length(as.numeric(data$subj[data$seq == "RT"]))/2 muR    <- mean(log(data$PK[data$trt == "R"])) muT    <- mean(log(data$PK[data$trt == "T"])) delta  <- muT-muR se     <- s*sqrt(0.5*(1/n2+1/n2)) t1     <- (delta-log(theta1))/se t2     <- (log(theta2)-delta)/se p.TOST <- c(p1 = pt(t1, df = nu, lower.tail = FALSE),             p2 = pt(t2, df = nu, lower.tail = FALSE)) if (max(p.TOST) < alpha) {   BE.TOST <- sprintf("p(<%.4f) <%.2g and p(>%.4f) <%.2g: %s",                      theta1, alpha, theta2, alpha, "passed BE") } if (p.TOST[["p1"]] < alpha & p.TOST[["p2"]] >= alpha) {   BE.TOST <- sprintf("p(<%.4f) <%.2g and p(>%.4f) \u2265%.2g: %s",                      theta1, alpha, theta2, alpha, "failed BE (inconclusive)") } if (p.TOST[["p1"]] >= alpha & p.TOST[["p2"]] < alpha) {   BE.TOST <- sprintf("p(<%.4f) \u2265%.2g and p(>%.4f) <%.2g: %s",                      theta1, alpha, theta2, alpha, "failed BE (inconclusive)") } if (p.TOST[["p1"]] >= alpha & p.TOST[["p2"]] >= alpha) {   BE.TOST <- sprintf("p(<%.4f) \u2265%.2g and p(>%.4f) \u2265%.2g: %s",                      theta1, alpha, theta2, alpha, "inequivalent") } ##################### # library PowerTOST # ##################### library(PowerTOST) pt.ci <- CI.BE(alpha = alpha, pe = pe, CV = se2CV(s), n = c(n1, n2)) pt.p  <- pvalues.TOST(pe = pe, CV = se2CV(s), n = c(n1, n2),                       theta1 = theta1, theta2 = theta2) cat(paste0("\ntheta1 = ", sprintf("%.4f", theta1),            "\ntheta2 = ", sprintf("%.4f", theta2),            "\nalpha  = ", alpha, sprintf("\n%g%% CI: ", 100*(1-2*alpha)),            "lower CL = ", sprintf("%.4f", ci["lower"]),            "\n        upper CL = ", sprintf("%.4f", ci["upper"]),            "\n        ", BE.ci,            "\nTOST  : p(<", sprintf("%.4f) = %.4g", theta1, p.TOST[["p1"]]),            "\n        p(>", sprintf("%.4f) = %.4g", theta2, p.TOST[["p2"]]),            "\n        ", BE.TOST,            "\n\nChecks with library PowerTOST",            "\nCI.BE(alpha = ", alpha, ", ...):",            "\n  lower   = ", sprintf("%.4f", pt.ci["lower"]),            "\n  upper   = ", sprintf("%.4f", pt.ci["upper"]),            "\npvalues.TOST():",            "\n  p.left  = ", sprintf("%.4g", pt.p[["p.left"]]),            "\n  p.right = ", sprintf("%.4g", pt.p[["p.right"]])), "\n")

Dif-tor heh smusma 🖖
Helmut Schütz The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
d_labes
★★★

Berlin, Germany,
2020-12-07 11:11
(317 d 08:47 ago)

@ Helmut
Posting: # 22120
Views: 1,982

## CI inclusion operationally identical to TOST

Dear Helmut,

» ...
» Given, Donald used the phrase “operationally identical” on p.661 (right column, 2nd paragraph).
»
» However, for me (!) those are two different “operations”. Results of an example:
» 90% CI: lower CL = 0.8448
»         upper CL = 1.1003
»         CI within 0.8000 and 1.2500: passed BE
» TOST  : p(<0.8000) = 0.01239
»         p(>1.2500) = 0.001565
»         p(<0.8000) <0.05 and p(>1.2500) <0.05: passed BE

Of course the two calculations are different, no doubt about it.

I have understood “operationally identical” always as the fact that TOST and CI inclusion give the same answer with regard to the BE decision.

IMHO this is the meaning of the paragraph on page 661 in Donalds famous paper containing “operationally identical”:
"The two one-sided tests procedure turns out to be operationally identical to the procedure of declaring equivalence only if the ordinary 1 - 2α (not 1-α) confidence interval for µT-µR is completely contained in the equivalence interval [θ1, θ2]".
Emphasis by me.

Regards,

Detlew
Helmut
★★★  Vienna, Austria,
2020-12-07 13:02
(317 d 06:56 ago)

@ d_labes
Posting: # 22122
Views: 1,980

Dear Detlew,

» Of course the two calculations are different, no doubt about it.

Like $$\small{2+2+2+2=2\times4=2^3=8}$$. Different calculations, same result. » I have understood “operationally identical” always as the fact that TOST and CI inclusion give the same answer with regard to the BE decision.

Acc. to Berger and Hsu not sure about the “always”. But that’s another story and of historical interest only.

» IMHO this is the meaning of the paragraph on page 661 in Donalds famous paper containing “operationally identical”:
» "The two one-sided tests procedure turns out to be operationally identical to the procedure of declaring equivalence only if the ordinary 1 - 2α (not 1-α) confidence interval for µT-µR is completely contained in the equivalence interval [θ1, θ2]".
» Emphasis by me.

Correct. Also in Chow and Liu (p.98):

The two one-sided t tests procedure is operationally equivalent to the classic (shortest) confidence interval approach; that is, if the classic (1–2α)100% confidence interval for μTμR is within (θL, θU), then both H01 and H02 are also rejected at the α level by the two one-sided t tests procedure.

Coming back to the WHO’s rant:

The calculation of the 90% confidence interval (CI) of the mean test/comparator ratio for the primary PK parameters should not be confused with the two one-sided t-tests employed to reject the null hypothesis of non-equivalence. The end result is the same, but these are not the same calculations.

IMHO, they are just fed up reading “TOST” whilst the CI inclusion approach acc. to the GL was actually performed.
• Strict TOST (p-values only): Deficiency letter “Where is the CI?”
• Claiming TOST, reporting the CI: Sloppy terminology, copy&paste from the dark ages?

Dif-tor heh smusma 🖖
Helmut Schütz The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
ElMaestro
★★★

Denmark,
2020-12-07 14:54
(317 d 05:04 ago)

@ Helmut
Posting: # 22123
Views: 1,901

Will they be cracking down on spelling errors too, you think? Pass or fail!
ElMaestro
d_labes
★★★

Berlin, Germany,
2020-12-07 15:35
(317 d 04:23 ago)

@ Helmut
Posting: # 22124
Views: 1,873

Dear Helmut,

» Coming back to the WHO’s rant:

The calculation of the 90% confidence interval (CI) of the mean test/comparator ratio for the primary PK parameters should not be confused with the two one-sided t-tests employed to reject the null hypothesis of non-equivalence. The end result is the same, but these are not the same calculations.

» IMHO, they are just fed up reading “TOST” whilst the CI inclusion approach acc. to the GL was actually performed.

Totally correct to lament about that fact, I think. It should unequivocally described in the protocol or the SAP which calculations will be done . The CI approach will be the favorite I think. It is requested in all guidelines about BE studies, if I dont err.

Regards,

Detlew
Helmut
★★★  Vienna, Austria,
2020-12-22 19:00
(302 d 00:58 ago)

@ Ohlbe
Posting: # 22155
Views: 1,667

## WHO frequent deficiencies in protocols and reports

Dear Ohlbe,

» The WHO has published a 7-page document listing frequent deficiencies in BE protocols, reports and practices.
»
» Helmut will be happy with at least one recommendation: no need to calculate and report post-hoc power.

Oh yes, indeed. More in detail (my comments in blue):
• The test and comparator products should be administered under the usual conditions of use. Therefore, it is not acceptable to administer the products under yellow monochromatic light to avoid degradation; normal lighting conditions should be employed.
That’s extremely funny. Even if the drug is susceptible to hν-induced degradation (say, nitrendipine), what’s the penetration-depth into the formulation? A few microns? Which amount of the total API is contained in this volume and what’s the time of exposure? Less than a minute?
I have seen such procedures in protocols though it proves not only complete absence of scientific thinking but also lacking common sense.

• SAMPLING TIMES
• Sample collection after 72 hours is not necessary.
Hhm, for IR yes but for controlled release?
• The wash out period should not be excessively large compared with 5 times the largest expected half-life.
Why not? In crossover studies subjects have to be in the same physiological state in higher periods than in the first. Some drugs are auto-inducers or -inhibitors. The t½ of the drug of a single dose tells only half of the story. OK, what is meant by “excessively large”? Possibly a long washout was intentional and for a reason.

• PHARMACOKINETIC ANALYSIS
• The protocol should indicate the software to be used for pharmacokinetic calculations as well as the trapezoidal method employed for AUC calculation.
Thank you very much! It’s high time to abandon the linear trapezoidal rule.
• Any below LLOQ value(s), including those between two valid concentration values, should be reported as zero.
What the heck?
• For drugs with long half-life, AUC truncated at 72 h could be used. However, in the event of a missing sample at 72 hours, that profile should be excluded. In the case of a 2×2 design, this implies the exclusion of all AUC data for that subject.
That’s bad because information is lost. What about using the estimated concentration at 72 hours (i.e., pAUC0–72 available in software) or AUC0–tlast(Common)?

• SAMPLE SIZE CALCULATION
ESTIMATION, if you don’t mind.
• The sample size calculation for a replicate design with widening of the acceptance range should be calculated as described by Tothfalusi and Endrenyi in “Sample Sizes for Designing Bioequivalence Studies for Highly Variable Drugs”. J Pharm Pharmaceut Sci 15(1) 73-84, 2012.
What else?
The conventional methods for replicate designs do not take into account the impact of acceptance range widening on the sample size calculation.
Oops, that’s a “frequent deficiency”? Oh dear! Don’t forget the upper cap of scaling (50%) and the point estimate restriction (within 80.00–125.00%). Use PowerTOST’s function sampleN.scABEL() or simulate a couple of hours in SAS or MatLab.
• The sample size calculation for 2×2 cross-over designs or parallel designs are often not justified adequately or presented with sufficient detail. […] the expected inter-subject variability for parallel designs or intra-subject variability for the cross-over designs.
Picky: Should read ‘total variability for parallel designs’. The inter-subject variability is not accessible in a previous study in a parallel design.
• Sample size is sometimes calculated to detect a difference between treatments instead of being based on a calculation aimed to show equivalence.
Unbelievable – the FDA’s infamous ‘power approach’ of 1972!
• In case of a parallel design, it is extremely important to obtain balanced groups in all demographic characteristics that might impact the pharmacokinetics of the drugs. The methods employed to ensure balanced groups are generally not described in the protocols.
Cannot agree more. Don’t forget to pheno-/genotype subjects for drugs showing polymorphic metabolism and include extensive metabolizers only.

• STATISTICAL ANALYSIS
• Due to the statistical complexity of the alpha level expenditure in two-stage bioequivalence cross-over trials, two stage designs are not encouraged and, if used, the design should be as simple as possible e.g., with equal sizes in both stages.
Sorry to say, but that’s just crap. Generally sponsors design TSDs in such a way that the chance to demonstrate BE already in the first stage is reasonably high and the – optional – second stage acts as a kind of “safety net”. That also implies that n1 > n2. Maybe they had a classical group-sequential design with fixed total sample size and one interim at N/2 in mind. Boring.
The Applicant should demonstrate that the consumer risk is not inflated above 5% with the proposed design and alpha expenditure rule, taking into account that simulations are not considered sufficiently robust and analytical solutions are preferred.
• Fine. An exact method exists for 2×2 crossovers (the ‘Inverse Normal Combination Method’ with either the ‘Standard Combination Test’ or the ‘Maximum Combination Test’).
• For parallel designs we have only simulation-based methods.
• Replicate designs / reference-scaling? Belongs to the realm of Science Fiction.
• The statistical procedure should be conducted without imputing values to the missing observations.
Why not? Regardless which trapezoidal method we use in NCA, missing concentrations are interpolated.
• In those cases where the subjects are recruited and treated in groups, it is appropriate to investigate the statistical significance of the group-by-formulation interaction e.g., with the following ANOVA model: Group, Sequence, Formulation, Period (nested within Group), Group-by-Sequence interaction, Subject (nested within Group*Sequence) and Group-by-Formulation interaction. […]
That’s copypasted from a yellowed FDA deficiency-letter. Complete nonsense and – like any pre­test – inflates the Type I Error.
• The a posteriori power of the study does not need to be calculated.
I would prefer: “… must not be calculated.”
• It is not necessary to calculate the non-parametric 90% CI of Tmax. A numerical comparison of the median values and its range is considered sufficient.
Using the capital letter T (SI for the absolute temperature in Kelvin) is forgiven. However, a numerical comparison of the range of tmax is crap.

• EXCLUSION OF DATA
• In order to exclude the pharmacokinetic results of those subjects who vomit during the study, the protocol should define in hours the value of two times median Tmax (as documented in the literature)…
Regrettably quite often the innovator gives the arithmetic mean…
I would add for controlled release: Within the entire intended dosing interval.

» I expect some others to stir some discussions here (e.g. 2-stage design) Correct. Not only that. Dif-tor heh smusma 🖖
Helmut Schütz The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
ElMaestro
★★★

Denmark,
2020-12-23 08:12
(301 d 11:46 ago)

@ Helmut
Posting: # 22156
Views: 1,592

## "obtain balanced groups" vs randomisation

Hi all,
»
• In case of a parallel design, it is extremely important to obtain balanced groups in all demographic characteristics that might impact the pharmacokinetics of the drugs. The methods employed to ensure balanced groups are generally not described in the protocols.

This is a very tricky way to write something for clarification.
It is like they give you a choice - Do you want to randomise or do you want to "obtain balanced groups" ? On average (whatever the hell that means) plain randomisation assures balance, but randomisation does not in any individual case guarantee balance. Perhaps they meant to combine stratification with randomisation. Or what?

Had they written "it is extremely important to aim for balanced groups" then I'd get the point. But that would not change anything from today's practice, as all CRO's to the best of my knowledge are randomising.
What is it they really tried to clarify? And how do I do comply?

Pass or fail!
ElMaestro Ing. Helmut Schütz 