Bioequivalence and Bioavailability Forum

Helmut
★★★

Vienna, Austria,
2014-07-21 04:04
(3950 d 07:47 ago)

Posting: # 13279
Views: 14,616

EMA: α-inflation – the suspicion begins to mount [RSABE / ABEL]

Dear all,

maybe some of you remember this ~~interesting~~ scary discovery of Detlew. This week a paper was published by a German group* confirming his findings. Appetizers:

Surprisingly, when ‘scaled’ bioequivalence limits were set as bioequivalence limits, the highest rejection rate observed was at the lowest variability investigated. With CV_ANOVA of 30%, it was 7.05%, and it was still 5.39% with a variability of 40%. Therefore, taking into account a simulation error of roughly 0.5%, and the fact that our simulations are based on uncorrelated data and do not consider interindividual variabilities, one may doubt that an α-error of 5% is controlled even with the pre-set ‘scaled’ limits, at least for variabilities close to the cut-off point of CV_ANOVA of 30%.

Table II Rejection Rate After 10000 Simulations […] and Empirical α-Error Rate at the ‘Scaled’ BE Limits […] According to EMA with Increasing Intraindividual VariabilityCV_ANOVA [%] N GMR Empirical α-error rate [%] ──────────────────────────────────────────────── 30 22 1.250 7.05 * 35 25 1.295 5.58 * 40 27 1.340 5.39 * 45 27 1.386 4.25 50 28 1.432 3.51 ────────────────────────────────────────────────

(* significantly >0.05; my addition)

Try this code:

library(PowerTOST) CV <- seq(30, 50, 5) res <- data.frame(CV = CV, N = c(22, 25, 27, 27, 28), GMR = scABEL(CV/100)[, "upper"], pBE = NA, sig = "", stringsAsFactors = FALSE) for (j in seq_along(CV)) { res$pBE[j] <- round(100*power.scABEL(CV = CV[j]/100, theta0 = res$GMR[j], n = res$N[j], design = "2x3x3", nsims = 1e6), 2) } sig <- binom.test(0.05*1e6, 1e6, alternative = "less")$conf.int[[2]] res$sig[res$pBE/100 > sig] <- "*" names(res)[5] <- "" print(res, row.names = FALSE)

Do these results look familiar?

CV N GMR pBE 30 22 1.250000 6.88 * 35 25 1.294796 5.42 * 40 27 1.340165 5.04 * 45 27 1.385915 4.34 50 28 1.431910 3.32

If you are planning a study for evaluation with EMA’s ABEL-method think about it.

Wonnemann M, Frömke C, Koch A. Inflation of the Type I Error: Investigations on Regulatory Recommendations for Bioequivalence of Highly Variable Drugs. Pharm Res. 31 (preprint published 18 July 2014) doi:10.1007/s11095-014-1450-z.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

ElMaestro ★★★ Denmark, 2014-07-21 11:24 (3950 d 00:26 ago) @ Helmut Posting: # 13281 Views: 13,061	EMA: α-inflation – the suspicion begins to mount Post reply
	Hi Hötzi, d_labes should publish a commentary in the journal. I think his figures were sort of more telling weren't they? By the way, one of the authors used to be a member of the PK subgroup back in the day. — Pass or fail! ElMaestro

Helmut
★★★

Vienna, Austria,
2014-07-21 18:14
(3949 d 17:37 ago)

@ ElMaestro
Posting: # 13284
Views: 13,172

Why simulate a simple AB|BA at all?

Post reply

Hi ElMaestro,

❝ d_labes should publish a commentary in the journal. I think his figures were sort of more telling weren't they?

Yes. Given the unpleasant experience we had last year submitting a letter to the namely journal I’m not sure whether he will risk the efforts…

BTW, I have some mixed feelings about the paper. I don’t have the slightest idea why the authors simulated the rejection rate for a conventional 2×2 cross-over. Power can be directly calculated for any given combination of α, CV, GMR, and N. One should never ever get anything >0.05! So where does this hump at N>30 in Fig. 1 come from?
Try:

library(PowerTOST) n <- seq(8, 100, 2) CV <- c(10, 20, seq(30, 55, 5)) pBE <- vector("numeric", length=length(n)) clr <- colorRampPalette(c("blue", "red"))(length(CV)) for(j in seq_along(CV)) { for(k in seq_along(n)) { pBE[k] <- power.TOST(CV=CV[j]/100, theta0=1.25, n=n[k]) if(j == 1 & k == length(n)) { plot(n, pBE, type="l", ylim=c(0, 0.05), las=1, lwd=2, col=clr[j]) abline(h=0.05, lty=3) } } lines(n, pBE, type="l", lwd=2, col=clr[j]) text(n[j], pBE[j], labels=paste0(CV[j],"%")) }

Another point are the number of simulations. We know the slow convergence in these kind of sim’s. 5,000 for power and 10,000 for empiric α are by a factor of 100 too low.

❝ […] one of the authors used to be a member of the PK subgroup back in the day.

Interesting.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

ElMaestro
★★★

Denmark,
2014-07-21 22:50
(3949 d 13:01 ago)

@ Helmut
Posting: # 13285
Views: 13,025

Why simulate a simple AB|BA at all?

Post reply

Hi Hötzi,

❝ Yes. Given the unpleasant experience we had last year submitting a letter to the namely journal I’m not sure whether he will risk the efforts…

Are we talking about Detlew der Hosenscheisser or Detlew the Conqueror?

❝ BTW, I have some mixed feelings about the paper. I don’t have the slightest idea why the authors simulated the rejection rate for a conventional 2×2 cross-over. Power can be directly calculated for any given combination of α, CV, GMR, and N. One should never ever get anything >0.05! So where does this hump at N>30 in Fig. 1 come from?

Hmmmm that's a relevant question. Perhaps they just wanted to demonstrate internal validity of the sim algos? I also use GraphPad Prism for graphing, though an earlier version. It does not stack or offset curves, so they really do seem to have gotten a little hump of sorts.

One conclusion to draw is that the whole scaling business might not be that smart when it comes to type I errors. At the more general level another proposal is that before a group of guideline authors agree on a new requirement they should investigate the requirement by specific simulations rather than just follow their instincts or adopt concepts that were made for a different null hypothesis ("0.0294" .... do I need to say more?).

—
Pass or fail!
ElMaestro

Helmut
★★★

Vienna, Austria,
2014-07-22 01:13
(3949 d 10:37 ago)

@ ElMaestro
Posting: # 13286
Views: 13,142

Fancy smoothing?

Post reply

Hi ElMaestro,

❝ Hmmmm that's a relevant question. Perhaps they just wanted to demonstrate internal validity of the sim algos? I also use GraphPad Prism for graphing, though an earlier version. It does not stack or offset curves, so they really do seem to have gotten a little hump of sorts.

Below a comparison: Calculated, 10⁴, and 10⁶ sim’s. The red dotted line on top is the significance limit (0.05373 for 10⁴ and 0.05036 for 10⁶).

[image]

I tried also to repeat Japan’s bizarre method (Fig. 2 of the paper):

[image]

A lot of ‘noise’, but essentially I could reproduce the reported inflation (converging at ~7.5%). Inflation with this method is a textbook example (unadjusted multiple testing). I’m happy that finally someone demonstrated it. The inflation is what to expect from pooling two groups, where the size of the second one is 50% of the first: 0.05+0.05/2=0.075… Voilà.
For my experiences in 日本 see the end of this post. :lookaround:

❝ One conclusion to draw is that the whole scaling business might not be that smart when it comes to type I errors.

If EMA’s method is concerned, yes.

❝ At the more general level another proposal is that before a group of guideline authors agree on a new requirement they should investigate the requirement by specific simulations rather than just follow their instincts or adopt concepts that were made for a different null hypothesis […]

Exactly. It’s not only statistics which sucks. The PK group single-handed invented a bunch of PK metrics whose relevance and sensitivity to formulation differences are not supported by a single publication. What the heck is AUC_τ/2? We had three conferences on the MR GL, hundreds of pages of comments… Guess what? Felt like talking to a brick wall.

❝ […] ("0.0294" .... do I need to say more?).

No.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

d_labes
★★★

Berlin, Germany,
2014-07-22 10:43
(3949 d 01:08 ago)

@ Helmut
Posting: # 13289
Views: 12,947

Arbitrary smoothing?

Post reply

Dear Helmut,

❝ A lot of ‘noise’, but essentially I could reproduce the reported inflation (converging at ~7.5%). Inflation with this method is a textbook example (unadjusted multiple testing). I’m happy that finally someone demonstrated it. The inflation is what to expect from pooling two groups, where the size of the second one is 50% of the first: 0.05+0.05/2=0.075… Voilà.

Where does the tremendous noise in your pictures came from?

And Fig. 2 (as well as others) of the paper is with tremendous (arbitrary?) smoothing? Crossing of some of the curves seems to point in that direction. The plot form (continuous curves, without the points they have simulated) itself is suspicious IMHO.

BTW: Do you see any need to create a power.addon() in PowerTOST? Pedagogical?

—
Regards,

Detlew

Helmut
★★★

Vienna, Austria,
2014-07-22 15:31
(3948 d 20:20 ago)

@ d_labes
Posting: # 13290
Views: 13,024

Noise…

Post reply

Dear Detlew,

❝ Where does the tremendous noise in your pictures came from?

Quick and dirty as ever; I set a different seed for every simulation. Looks better if keeping the same one:

[image]

❝ And Fig. 2 (as well as others) of the paper is with tremendous (arbitrary?) smoothing? Crossing of some of the curves seems to point in that direction. The plot form (continuous curves, without the points they have simulated) itself is suspicious IMHO.

Agree.

❝ BTW: Do you see any need to create a power.addon() in PowerTOST? Pedagogical?

I guess enthusiasts could already misuse* the – experimental – function power.2stage.GS() as I did. I tried alpha=c(0.05, 0.05) and n=c(n, n/2+(n/2)%%2). The modulo in the second part of the n-vector rounds up to the next even number because I don’t like imbalanced studies. Example: n = 18, n/2 = 9, n/2+(n/2) mod 2 = 10.
power.addon() would be nice, of course. ;-)

BTW, the Japanese guidance states:

The add-on subject study should include at least one half of the number of subjects in the initial study.

I overlooked that until today (believing that n₂ is fixed at n₁/2). This opens the door to an infinite number of designs (adjusting for the observed CV, any power, even fully adaptive). Glad that I never dealt with submissions to Japan so far.

Due to the futility criterion in this function there are cases where the study stops after the first part. Anyhow, I saw inflation of ~7.55%.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

d_labes
★★★

Berlin, Germany,
2014-07-22 18:06
(3948 d 17:45 ago)

@ Helmut
Posting: # 13291
Views: 12,993

TSD Japonica…

Post reply

Dear Helmut,

❝ ❝ BTW: Do you see any need to create a power.addon() in PowerTOST? Pedagogical?

❝

❝ I guess enthusiasts could already misuse the – experimental – function power.2stage.GS() as I did.

Clever, clever such enthusiasts! :cool:

Although not quite clear to me if this meets the "Addonionsis Japonica" in all respects.
At least it is not what Wonnemann et al. simulated. No futility criterion was used by them if I read the paper correct (see below what power.2stage.GS() does).

❝ BTW, the Japanese guidance states:

The add-on subject study should include at least one half of the number of subjects in the initial study.

I overlooked that until today (believing that n₂ is fixed at n₁/2). This opens the door to an infinite number of designs (adjusting for the observed CV, any power, even fully adaptive).

Seems the story goes further (just befor your quote):
"If bioequivalence cannot be demonstrated because of an insufficient number, an add-on subject study can be performed ..."

If one reads this, it may call for some criteria what an insufficient number is. Knowing Potvin et al. TSD's this may be interpreted as call for some power calculation step after 'stage 1'.

In power.2stage.GS() the criterion for continuing to stage 2 is:
- not BE in the first stage
- and result was not futile (via a PE or CI criterion)

—
Regards,

Detlew

Helmut
★★★

Vienna, Austria,
2014-07-22 19:50
(3948 d 16:01 ago)

@ d_labes
Posting: # 13292
Views: 12,959

TSD Japonica…

Post reply

Dear Detlew,

❝ […] not quite clear to me if this meets the "Addonionsis Japonica" in all respects.

No, it doesn’t. That’s why I called it misuse. Thoughtlessly I edited my post not being aware of your answer already posted. I’ve set the futility criterion to extreme values, but I was aware that still some studies might simply stop.

❝ At least it is not what Wonnemann et al. simulated. No futility criterion was used by them if I read the paper correct […]

Yes, you do.

❝ Seems the story goes further (just befor your quote):

❝ "If bioequivalence cannot be demonstrated because of an insufficient number, an add-on subject study can be performed ..."

Oh no!

❝ If one reads this, it may call for some criteria what an insufficient number is. Knowing Potvin et al. TSD's this may be interpreted as call for some power calculation step after 'stage 1'.

Seems so. The terminology is consistent, since in the preceding sentence the guidance asks for a “sufficient” number of subjects in planning the sample size.

❝ Knowing Potvin et al. TSD's this may be interpreted as call for some power calculation step after 'stage 1'.

If one wants to find a suitable adjustment avoiding inflation the only difference to Potvin & Co. would be restricting n₂ to ≥n₁/2.

❝ In power.2stage.GS() the criterion for continuing to stage 2 is:

❝ - not BE in the first stage

❝ - and result was not futile (via a PE or CI criterion)

For the latter case I used 0.01/100, but the former may still hit.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

d_labes
★★★

Berlin, Germany,
2014-07-24 15:47
(3946 d 20:03 ago)

@ Helmut
Posting: # 13307
Views: 12,836

Noise debugged

Post reply

Dear Helmut,

❝ ❝ Where does the tremendous noise in your pictures came from?

❝

❝ Quick and dirty as ever...

Mea culpa, mea maxima culpa.
Quick and dirty was on my side. The noise comes from a nasty bug in power.2stage.GS() which let the power values jumping around and not converging with increasing number of sims :crying:

.
I noted some of this behaviour already earlier and thus called this function "experimental". But than forgot it.

Corrected version is under way. Sorry for any inconvenience.

BTW: To calculate without futility criterion set fCrit="PE" and set fClower=0 (implies fCupper=Inf).

—
Regards,

Detlew

Helmut
★★★

Vienna, Austria,
2014-07-24 17:07
(3946 d 18:43 ago)

@ d_labes
Posting: # 13308
Views: 12,823

Noise debugged

Post reply

Dear Detlew,

❝ Quick and dirty was on my side. […] Corrected version is under way.

THX a lot!

❝ BTW: To calculate without futility criterion set fCrit="PE" and set fClower=0 (implies fCupper=Inf).

Ah – yes! Looks much better:

Fig 2:
[image]

Fig 4:

PS: In help/NEWS correct to Version 0.1-04 before uploading to CRAN.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

d_labes ★★★ Berlin, Germany, 2014-07-24 18:09 (3946 d 17:42 ago) @ Helmut Posting: # 13309 Views: 12,766	Shit happens Post reply
	Dear Helmut, ❝ PS: In help/NEWS correct to Version 0.1-04 before uploading to CRAN. Oh no! Too late my Dear. I hate this, have to change version number and date in more then one place and you can bet that I forgot at least one . Only the entry in DESCRIPTION is checked automatically during build. Fortunately this is only a "Schönheitsfehler". — Regards, Detlew

d_labes
★★★

Berlin, Germany,
2014-07-22 10:18
(3949 d 01:32 ago)

@ ElMaestro
Posting: # 13288
Views: 12,953

Commentary

Post reply

My Dear,

❝ ❝ Yes. Given the unpleasant experience we had last year submitting a letter to the namely journal I’m not sure whether he will risk the efforts…

❝

❝ Are we talking about Detlew der Hosenscheisser or Detlew the Conqueror?

We are talking about Detlew the Mahatma :-D

"God, grant me the serenity to accept the things I cannot change,
courage to change the things I can
and wisdom to know the difference."

And changing the policy of Pharm. Res. (no letter to the editor since many years) is beyond my reach as experienced.

—
Regards,

Detlew

nobody nothing 2014-09-15 10:24 (3894 d 01:27 ago) @ d_labes Posting: # 13511 Views: 12,296	Commentary Post reply
	"And changing the policy of Pharm. Res. (no letter to the editor since many years) ..." Rrrrrrreally sure? Look here: click me... Top of the list... — Kindest regards, nobody

Helmut ★★★ Vienna, Austria, 2014-09-15 12:12 (3893 d 23:39 ago) @ nobody Posting: # 13512 Views: 12,404	Commentary Post reply
	Hi nobody, that’s amazing! Last year I searched the archives of Pharm Res and couldn’t find a single one. Now I got 34; I guess I was just too stupid. — Dif-tor heh smusma 🖖🏼 Довге життя Україна! Helmut Schütz The quality of responses received is directly proportional to the quality of the question asked. 🚮 Science Quotes

Helmut
★★★

Vienna, Austria,
2014-09-29 05:33
(3880 d 06:17 ago)

@ ElMaestro
Posting: # 13611
Views: 12,169

IBE/PBE = Two-Stage

Post reply

Hi ElMaestro,

❝ One conclusion to draw is that the whole scaling business might not be that smart when it comes to type I errors.

Found a goody in Chow/Liu (3rd ed., but I guess you find it in the earlier ones as well). Chapter 19, Review of Regulatory Guidances on Bioequivalence, 19. Guidances on Statistical Procedures, 19.2.10 Two-Stage Test Procedure

To apply the proposed criteria for assessment of PBE or IBE, the 2001 FDA guidance suggests that a constant scale be used if the observed estimator of σ_TR or σ_WR is smaller than σ_T0 or σ_W0. However, statistically, the observed estimator of σ_TR or σ_WR being smaller than σ_T0 or σ_W0 does not mean that σ_TR or σ_WR is smaller than σ_T0 or σ_W0. A test on the null hypothesis that σ_TR or σ_WR is smaller than σ_T0 or σ_W0 is necessarily performed. As a result, the proposed statistical procedure for assessment of PBE or IBE becomes a two-stage test procedure. It is then recommended that the overall type I error rate and the calculation of power be adjusted accordingly.

PBE and IBE were never seriously implemented, but with RSABE and ABEL we face the same shyte.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

ElMaestro
★★★

Denmark,
2014-09-29 17:16
(3879 d 18:34 ago)

@ Helmut
Posting: # 13617
Views: 12,047

IBE/PBE = Two-Stage

Post reply

Hi Helmut,

❝ PBE and IBE were never seriously implemented, but with RSABE and ABEL we face the same shyte.

thanks for posting. It is an interesting quote but I do not understand much of its implications, to be honest. That might be just because I am rather inexperienced with rsABE etc.

Having said that, FDA have gone full tilt ahead with population BE for budesonide and other inhalation drugs at the in vitro level. I wonder how the pieces actually fit together.

—
Pass or fail!
ElMaestro

Helmut
★★★

Vienna, Austria,
2014-09-30 17:03
(3878 d 18:47 ago)

@ ElMaestro
Posting: # 13621
Views: 12,114

Sequential Design = Inflation likely…

Post reply

Hi ElMaestro

❝ It is an interesting quote but I do not understand much of its implications, to be honest. That might be just because I am rather inexperienced with rsABE etc.

See Fig.3 from Davit et al.*

[image]

Is this a sequential design or not? Forget the protocol review, just look at the left branch. Will the type I error be inflated? In some cases, yes. Remember that in conventional (unscaled) ABE the empiric α for some combinations of sample size and CV is substantially lower than the nominal α of 0.05. Similar here. If we don’t adjust α, the lower empiric than nominal level of TOST will protect us – maybe. But: We can expect to face inflation, especially close to CV_WR 30%.

❝ […] FDA have gone full tilt ahead with population BE for budesonide and other inhalation drugs at the in vitro level. I wonder how the pieces actually fit together.

Not my field of expertise. Go and have my next puff now.

Davit BM, Ling Chen M-L, Conner DP, Sam H. Haidar SH, Kim S, Lee CH, Lionberger RA, Makhlouf FT, Nwakama PE, Patel DT, Schuirmann DJ, Yu LX. Implementation of a Reference-Scaled Average Bioequivalence Approach for Highly Variable Generic Drug Products by the US Food and Drug Administration. AAPS J. 2012;14(4):915–24. doi:10.1208/s12248-012-9406-x.

—
Dif-tor heh smusma 🖖🏼 Довге життя Україна!
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

EMA: α-inflation – the sus­pi­cion begins to mount [RSABE / ABEL]

EMA: α-inflation – the sus­pi­cion begins to mount

Why simulate a simple AB|BA at all?

Why simulate a simple AB|BA at all?

Fancy smoothing?

Arbitrary smoothing?

Noise…

TSD Japonica…

TSD Japonica…

Noise debugged

Noise debugged

Shit happens

Commentary

Commentary

Commentary

IBE/PBE = Two-Stage

IBE/PBE = Two-Stage

Sequential Design = Inflation likely…

EMA: α-inflation – the suspicion begins to mount [RSABE / ABEL]

EMA: α-inflation – the suspicion begins to mount