Dr_Dan
★★  

Germany,
2015-05-19 10:29
(3236 d 10:11 ago)

Posting: # 14853
Views: 19,919
 

 Does unequal alpha distribution make sense? [Two-Stage / GS Designs]

Dear all
There are some People who recommend to use a very small alpha in the first stage and a large alpha in the next.

Stage I:  0.001 leading to 99.8% CI
Stage II: 0.049 leading to 90.2% CI

Chance of approval in the first go is low, but you will have a comfortable CI in the second stage. However, as I learned from our famous captain, it is almost impossible to argue that it makes a practical difference. Do you agree or disagree? What is your experience?
Looking forward to your valuable input
Kind regards
Dr_Dan

Kind regards and have a nice day
Dr_Dan
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-05-28 17:04
(3227 d 03:36 ago)

@ Dr_Dan
Posting: # 14870
Views: 18,614
 

 Maybe; but watch the TIE!

Hi Dan,

❝ There are some People who recommend to use a very small alpha in the first stage and a large alpha in the next.


❝ Stage I:  0.001 leading to 99.8% CI

❝ Stage II: 0.049 leading to 90.2% CI


Your values are according to Haybittle/Peto. However, like Pocock’s “magic 0.0294” these numbers were derived for superiority testing in group sequential designs (normal distributed data, parallel groups). I know some companies routinely applying this “method” but obviously never ever assessed the type I error in the BE setting (equivalence, crossover). The GL is clear:

… appropriate steps must be taken to preserve the overall type I error of the experiment …
… the choice of how much alpha to spend at the interim analysis is at the company’s discretion.

“is at the company’s discretion” should read “has to be justified by the applicant” since – with one exception* – nothing is pub­lished so far! 0.001/0.049 may lead to unacceptable inflation of the TIE. Example (location of the maxi­mum TIE in the assessed grid of n1 12–72, CV 10–80%):

library(Power2Stage)
power.2stage(method="B", alpha=c(0.001, 0.049), n1=12, GMR=0.95,
  CV=0.22, targetpower=0.8, theta0=1.25, nsims=1e6)

# Method B: alpha (s1/s2) = 0.001 0.049
# Target power in power monitoring and sample size est. = 0.8
# BE margins = 0.8 ... 1.25
# CV = 0.22; n(stage 1)= 12; GMR = 0.95
# GMR = 0.95 and mse of stage 1 in sample size est. used
# Futility criterion Nmax = Inf

# 1e+06 sims at theta0 = 1.25 (p(BE)='alpha').
# p(BE)    = 0.058031

In our pile of half-baked manuscripts we explored adjusted alphas which maintain the TIE. For a “type 1” TSD, GMR 0.95, target power 80%, α1 0.001 we found an adjusted α2 of 0.0413 suitable to maintain the patient’s risk:

library(Power2Stage)
power.2stage(method="B", alpha=c(0.001, 0.0413), n1=12, GMR=0.95,
  CV=0.22, targetpower=0.8, theta0=1.25, nsims=1e6)

# Method B: alpha (s1/s2) = 0.001 0.0413
# Target power in power monitoring and sample size est. = 0.8
# BE margins = 0.8 ... 1.25
# CV = 0.22; n(stage 1)= 12; GMR = 0.95
# GMR = 0.95 and mse of stage 1 in sample size est. used
# Futility criterion Nmax = Inf

# 1e+06 sims at theta0 = 1.25 (p(BE)='alpha').
# p(BE)    = 0.049939

Zheng’s 0.01/0.04 at the location of maximum inflation:

library(Power2Stage)
power.2stage(method="B", alpha=c(0.01, 0.04), n1=12, GMR=0.95,
  CV=0.24, targetpower=0.8, theta0=1.25, nsims=1e6)

Method B: alpha (s1/s2) = 0.01 0.04
Target power in power monitoring and sample size est. = 0.8
BE margins = 0.8 ... 1.25
CV = 0.24; n(stage 1)= 12; GMR = 0.95
GMR = 0.95 and mse of stage 1 in sample size est. used
Futility criterion Nmax = Inf

1e+06 sims at theta0 = 1.25 (p(BE)='alpha').
p(BE)    = 0.048782


❝ Chance of approval in the first go is low, but you will have a comfortable CI in the second stage.


Correct. If one does not want to take the chance to show BE already in the first stage and has enough time to almost always proceed to the second stage, why not? The first stage serves only to get an estimate of the CV and can be seen as an “internal pilot study”. Compared to published methods with an equal split of alphas the sample size penalty is lower.

❝ However, as I learned from our famous captain, it is almost impossible to argue that it makes a practical difference.


Ahoy! Difference to what?

❝ What is your experience?


See above for the thoughtless application of 0.001/0.049 which leads to an inflated TIE. Sooner or later the BSWP (which has TSDs on their work-plan for 2015) will realize this problem. Consequences? Re­cal­cu­late studies with a wider CI? What if a study fails with a 91.74% CI which passed with the reported 90.20% CI?


  • Zheng C, Zhao L, Wang J. Modifications of sequential designs in bioequivalence trials. Pharm Stat. 2015;14(3):180–8. doi:10.1002/pst.1672.

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
d_labes
★★★

Berlin, Germany,
2015-06-05 11:35
(3219 d 09:05 ago)

@ Helmut
Posting: # 14914
Views: 17,938
 

 Advantages

Hi Helmut, hi Dan,

Sorry for participating late on this thread.
Was on Holidays these times.

❝ ❝ Chance of approval in the first go is low, but you will have a comfortable CI in the second stage.


❝ Correct. If one does not want to take the chance to show BE already in the first stage and has enough time to almost always proceed to the second stage, why not? The first stage serves only to get an estimate of the CV and can be seen as an “internal pilot study”. Compared to published methods with an equal split of alphas the sample size penalty is lower.

Emphasis by me

This is exactly the reason why my Brötchengeber (man who signs the paycheck) recommend this settings. Especially for cases where the information about the variability of the drugs under consideration is very vague and would require on the classical way a pilot study and afterwards a pivotal study.

Let's see if this expectation is real:
Type 1 TSD (Potvin B) with n1=12.
Somewhat small IMHO, but you have to choose a small n1 to not waste money and due to ethical reasons if the variability proved to be small, for which already 12 subjects are sufficient.

Judging this as advantage or practically not is up to you.
Nominal alpha's like Pocock (0.0294, 0.0294) or adj. Haybittle/Peto (0.001, 0.0416 to control the overall TIE), targetpower 80%. Assume we have a bioequivalent product.

                      sample size  p stop at
CV    alpha's  power   mean   p95%  interim
--------------------------------------------
0.15   Poc    88.13%   13.5    22    79.3%   
       adjHP  88.32%   14.2    22    56.2%
0.25   Poc    81.17%   32.3    60    18.6%
       adjHP  79.91%   30.5    54     3.9%
0.4    Poc    74.95%   78.8   142     1.0%
       adjHP  75.18%   70.5   128     0.1%
0.7    Poc    72.42%  207.3   378     0%
       adjHP  72.93%  185.1   338     0%


With the exception of small variability, where the Pocock settings have a slight advantage w.r.t expected sample size and chance of stopping at interim the "asymmetric" alpha split proved to be really somewhat better w.r.t to sample size. Differences in power or probability to stop at interim are of no practical relevance I think.

Judging this as advantage or practically not is up to you :cool:.

Regards,

Detlew
ElMaestro
★★★

Denmark,
2015-06-05 12:07
(3219 d 08:33 ago)

@ d_labes
Posting: # 14915
Views: 17,832
 

 Advantages

Hi all,

                      sample size  p stop at

CV    alpha's  power   mean   p95%  interim

--------------------------------------------

0.15   Poc    88.13%   13.5    22    79.3%

       adjHP  88.32%   14.2    22    56.2%

0.25   Poc    81.17%   32.3    60    18.6%

       adjHP  79.91%   30.5    54     3.9%

0.4    Poc    74.95%   78.8   142     1.0%

       adjHP  75.18%   70.5   128     0.1%

0.7    Poc    72.42%  207.3   378     0%

       adjHP  72.93%  185.1   338     0%

❝ (...) Judging this as advantage or practically not is up to you :cool:.


This is a good example showing how complex the issue is.
Around CV=0.25 in this example the Pocock alphas performs better in terms of power than HP, and with a little higher sample size. CV is never known in advance of the experiment but the region of CV=0.25 is where most drugs (drug products) happen to be.

We really need some kind of summary statistic that is the ultimate indicator of whether one method is better than the other. Power divided by Avg Sample size would be my own best initial guess (to me this would be the bang you get for your buck)
I1=pwr/Navg

Or perhaps better:

I2=(pwr-0.7)/Navg


The more positive I2 the better, something like that. Negative I2 would possibly just go directly into the waste bin as I can't imagine anyone would be interested in a power lower than 0.7 or thereabouts (0.7 is taken a litle out of the blue, but I am noting that Potvin's methods all have power>0.7 for targetP=0.8).

Let me hear your opinion, do you have some ideas for a summary statistic that would help us decide between the various methods and versions?

PS: Perhaps this Q should be moved into its own separate thread?

Pass or fail!
ElMaestro
d_labes
★★★

Berlin, Germany,
2015-06-05 13:21
(3219 d 07:19 ago)

@ ElMaestro
Posting: # 14917
Views: 17,801
 

 Advantages – performance measure

Dear ElMaestro!

❝ We really need some kind of summary statistic that is the ultimate indicator of whether one method is better than the other. ...


Really good point!

Have a look into
Kieser, Rauch
"Two-stage designs for cross-over bioequivalence trials"
Stat Med 2015
DOI: 10.1002/sim.6487

They cite and use a performance score according to
Liu et al.
"Evaluating the adaptive performance of flexible sample size designs with treatment difference in an interval"
Stat Med 2008; 27:584–596.

But don't ask me for details! I haven't had a look into the last reference up to now :lookaround:.

Regards,

Detlew
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-06-05 15:29
(3219 d 05:10 ago)

@ d_labes
Posting: # 14918
Views: 17,941
 

 Advantages – performance measure

Dear both!

Liu et al.

"Evaluating the adaptive performance of flexible sample size designs with treatment difference in an interval"

❝ Stat Med 2008; 27:584–596.


❝ But don't ask me for details!


Tricky. Essentially they compare adaptive designs in terms of the expected total sample and power to a fixed sample design. They call the latter “ideal” which is only true for a known CV. Such a comparison is not fair but the only one we probably have. ;-)
Below some stuff (the adjusted HP is 0.0413, not 0.416 – which is for OF). “Type 1”, T/R 0.95, target power 0.80. N is the sample size of the “ideal” design or the expected average sample size in the TSDs. f1 is the ratio of sample sizes and f2 the ratio of expected power:

 CV  method    alpha(s)       N    power   f1    f2  
0.15 fixed     0.05    –      12   0.8305  –      –  
     adjPotvin 0.0302 0.0302  13.5 0.8830 1.123 1.063
     adjHP     0.001  0.0413  14.2 0.8825 1.187 1.063
     adjOF     0.005  0.0416  14.1 0.8738 1.179 1.052
0.25 fixed     0.05    –      28   0.8074   –     –  
     adjPotvin 0.0302 0.0302  32.0 0.8126 1.143 1.006
     adjHP     0.001  0.0413  30.5 0.7967 1.090 0.987
     adjOF     0.005  0.0416  30.4 0.7966 1.087 0.987
0.40 fixed     0.05    –      66   0.8053   –     –  
     adjPotvin 0.0302 0.0302  78.2 0.7500 1.184 0.931
     adjHP     0.001  0.0413  70.7 0.7516 1.071 0.933
     adjOF     0.005  0.0416  70.5 0.7528 1.069 0.935
0.70 fixed     0.05    –     174   0.8031   –     –  
     adjPotvin 0.0302 0.0302 205.6 0.7252 1.181 0.903
     adjHP     0.001  0.0413 185.6 0.7296 1.067 0.908
     adjOF     0.005  0.0416 185.1 0.7293 1.064 0.908

How to calculate their APS (average performance score) combining the sample size and power is beyond me.
I think there is no general rule (i.e., independent from the expected CV and planned stage 1 sample size) to judge which method performs “best”. For low CVs the larger average sample size may be outweighed by higher power (leaving some “headroom” for the ratio). If the CV is higher than ~0.25 any TSD with such a small n1 will lack power. Let’s play the game with an expected CV of 0.3 and an n1 of 24:
method    alpha(s)       N    power   f1    f2 
fixed     0.05    –      40   0.8158  –      – 
adjPotvin 0.0302 0.0302  39.4 0.8284 0.984 1.015
adjHP     0.001  0.0413  42.3 0.8110 1.057 0.994
adjOF     0.005  0.0416  41.9 0.8104 1.048 0.993

The winner is the symmetric split for my personal favorite n1 ~0.75 of fixed.
Furthermore, in TSDs the distribution of expected total sample sizes is not necessarily nor­mal. Actually it might be even bimodal (with increasing α1 the distribution gets “con­ta­mi­nated” by the fraction of studies stopping in the first stage). 105 sim’s each:


[image]

Whether it makes sense to compare designs based on the arithmetic mean (or even the median) remains an open issue.

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
d_labes
★★★

Berlin, Germany,
2015-06-05 23:27
(3218 d 21:13 ago)

@ Helmut
Posting: # 14920
Views: 17,826
 

 Advantages – performance measure

Dear Helmut!

❝ Tricky. Essentially they compare adaptive designs in terms of the expected total sample and power to a fixed sample design. They call the latter “ideal” which is only true for a known CV. Such a comparison is not fair but the only one we probably have. ;-)


TXH for educating me.
But I wonder why a not fair comparision should guide me. May it be it is the only we have or not. And a known CV would answer all our kwestiones. Even 42.

❝ Below some stuff (the adjusted HP is 0.0413, not 0.416 – which is for OF).


I expected that nitpicking :-D. Mine was from a sparser grid than that what we know now. But was too lazy to recalculate (Remember the Mahatma story?). IMHO this makes no practical difference.

❝ ... f1 is the ratio of sample sizes and f2 the ratio of expected power:


Means to me lower f1 is preferrable, higher f2 also. But you can arrive at higher power with higher than necessary sample size. I think this is meant with "overshot". And this isn't preferrable.

❝ ... How to calculate their APS (average performance score) combining the sample size and power is beyond me.


I will have a look next time. May be Buddha is enlighten me :cool:.

❝ ... The winner is the symmetric split for my personal favorite n1 ~0.75 of fixed.


I can't follow you here.

May it be that not only old beliefs die hard but also old Vorurteile (prejudices) :-D?

I don't see an overall better performance of the symmetrical split w.r.t to your explained f1 and f2 with a fixed n1. Sometimes f1 points to preference of Pocock adj., sometimes to preference for HP or OBF (adj. of course to control TIE), sometimes f2 in the same direction or vice versa.
And the n1 ~0.75*n of fixed rule is not applicable if you bear in mind that the preposition of my post was considerably uncertainty about the intra-subject variability or CV.

Of course, if you have some really good guesstimate of the CV your preferred rule is preferred from the preferred (aka initiated) :cool:.

BTW: How did you arrive at the histograms of the sample size?
Would it be helpful to have an option to obtain the data for such graphics from the functions in Power2Stage?

Regards,

Detlew
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-06-06 01:29
(3218 d 19:11 ago)

@ d_labes
Posting: # 14921
Views: 17,727
 

 Advantages – performance measure

Dear Detlew!

❝ TXH for educating me.


Come on! Didn’t I got the paper from you?

❝ But I wonder why a not fair comparision should guide me. May it be it is the only we have or not. And a known CV would answer all our kwestiones.


Agree. That’s why I wrote

❝ ❝ Such a comparison is not fair but the only one we probably have.


Even 42.

Indeed! In 1979 I had a great evening in Crete (Beer ⇒ Retsina ⇒ Ouzo ⇒ Raki) with Douglas Adams. Inspiring!

❝ ❝ Below some stuff (the adjusted HP is 0.0413, not 0.416 – which is for OF).


❝ I expected that nitpicking :-D.


Sorry.

❝ Mine was from a sparser grid than that what we know now.


I took mine from the Banana splits.

❝ But was too lazy to recalculate (Remember the Mahatma story?).


Sure. Didn’t I send you my sim’s?

❝ Means to me lower f1 is preferrable, higher f2 also.


Yes.

❝ But you can arrive at higher power with higher than necessary sample size. I think this is meant with "overshot". And this isn't preferrable.


Exactly. That’s why they have this ƒs scaling factor in the formula of the ROS.

❝ I will have a look next time. May be Buddha is enlighten me :cool:.


Buddha doesn’t enlighten anybody. Takes years of meditation to reach some degree of self-awareness. Maybe the Holy Spirit or the Flying Spaghetti Monster? :-D

❝ ❝ ... The winner is the symmetric split for my personal favorite n1 ~0.75 of fixed.


❝ I can't follow you here.


❝ May it be that not only old beliefs die hard but also old Vorurteile (prejudices) :-D?


Best only in this particular case. You cured me already!

❝ I don't see an overall better performance of the symmetrical split w.r.t to your explained f1 and f2 with a fixed n1.


I don’t think that there is one.

❝ Sometimes f1 points to preference of Pocock adj., sometimes to preference for HP or OBF (adj. of course to control TIE), sometimes f2 in the same direction or vice versa.


Exactly.

❝ And the n1 ~0.75*n of fixed rule is not applicable if you bear in mind that the preposition of my post was considerably uncertainty about the intra-subject variability or CV.


I missed that!

❝ Of course, if you have some really good guesstimate of the CV your preferred rule is preferred from the preferred (aka initiated) :cool:.


Yep.

❝ BTW: How did you arrive at the histograms of the sample size?


Crude example:
library(Power2Stage)
N <- as.numeric(power.2stage(CV=0.3, alpha=c(0.001, 0.0413), n1=24,
     npct=seq(0.001, 0.999, 0.001), print=FALSE, details=FALSE)$nperc)
hist(N, freq=FALSE)
abline(v=c(median(N), mean(N)), lty=c(1, 3))


❝ Would it be helpful to have an option to obtain the data for such graphics from the functions in Power2Stage?


For novices, why not? Add a parameter hist=TRUE|FALSE and calculate the tight percentiles – regardless what might be already set in the parameter npct.

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
d_labes
★★★

Berlin, Germany,
2015-06-06 14:58
(3218 d 05:42 ago)

@ Helmut
Posting: # 14923
Views: 17,572
 

 Histogram wonder

Dear Helmut!

❝ ❝ TXH for educating me.


❝ Come on! Didn’t I got the paper from you?


Not the Liu et al. which I don't have AFAIK.
Not so sure since lately I forgot things every now and than, seems aging or too much beer :-D.

❝ ...

❝ ❝ BTW: How did you arrive at the histograms of the sample size?


❝ Crude example:

library(Power2Stage)

  N <- as.numeric(power.2stage(CV=0.3, alpha=c(0.001, 0.0413), n1=24,
       npct=seq(0.001, 0.999, 0.001), print=FALSE, details=FALSE)$nperc)
  hist(N, freq=FALSE)
  abline(v=c(median(N), mean(N)), lty=c(1, 3))


Wow! That made me dumbfounded! Clever, clever.
:clap:

Regards,

Detlew
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-06-09 15:43
(3215 d 04:57 ago)

@ d_labes
Posting: # 14932
Views: 17,638
 

 Histogram charlatanry

Dear Detlew!

❝ Wow! That made me dumbfounded! Clever, clever.

:clap:


THX for the premature praise. Unfortunately my code is just approximate (euphemism for quick & dirty):

mean(N)
[1] 42.25025

Whereas correct is
power.2stage(CV=0.3, alpha=c(0.001, 0.0413), n1=24, print=F, details=F)$nmean
[1] 42.26836


If one wants to get the histograms right, download the source from CRAN and after the line
  } # end stage 2 calculations
add
  hist <- hist(ntot, plot=FALSE)
At the res-list change the last line to
  nmean=mean(ntot), nrange=range(ntot), nperc=quantile(ntot, p=npct),
  hist=hist
)


Now the data needed to plot a histogram of the total sample size are accessible for further use:

CV  <- 0.3
n1  <- 24
col <- "#AAF7FF"
alpha <- matrix(nrow=4, ncol=2, byrow=T, dimnames=list(
  c("Haybittle/Peto 0.0010/0.0413",
    "O\u2019Brien/Fleming 0.0050/0.0416",
    "Zheng 0.0100/0.0412",
    "Potvin 0.0302/0.0302"),
  c("alpha1", "alpha2")))
alpha[1, ] <- c(0.001, 0.0413)
alpha[2, ] <- c(0.005, 0.0416)
alpha[3, ] <- c(0.010, 0.0412)
alpha[4, ] <- rep(0.0302, 2)
invisible(split.screen(c(2, 2)))
for(j in 1:nrow(alpha)) {
  screen(j)
  res <- power.2stage(CV=CV, alpha=alpha[j, ], n1=n1, print=F, details=F)
  plot(res$hist, freq=F, xlim=c(12, 120), ylim=c(0, 0.1),
    main=rownames(alpha)[j], xlab=expression(n[total]),
    cex.main=0.95, cex.axis=0.9, col=col, las=1)
  abline(v=c(res$nmean, res$nperc[["50%"]]), lty=c(1, 3))
  legend("topright", box.lty=0, legend=c("mean", "median"), lty=c(1, 3),
    cex=0.9)
}
close.screen(all.screens=T)


[image]

I luf open source!

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
d_labes
★★★

Berlin, Germany,
2015-06-09 16:55
(3215 d 03:45 ago)

@ Helmut
Posting: # 14933
Views: 17,704
 

 Histogram beauty

Dear Helmut!

A better way would be to return the vector of ntot (your suggestion in our correspondence) instead of some histogram-data with standard settings.
That would give you all freedom to design the histogram to your needs.

Example: 100 bins in the histogram
code snippet
  hist(res$ntot, freq=F, breaks=100, main=des[j],
       cex.main=0.95, cex.axis=0.9,
       xlim=c(12, 120), ylim=c(0, 0.5),
       xlab=expression(N[total]),
       col=col, las=1)
  abline(v=c(res$nmean, res$nperc[["50%"]]), lty=c(1, 3))
  legend("topright", box.lty=0, legend=c("mean", "median"), lty=c(1, 3),
         cex=0.9)


[image]

Note the pronounced and outstanding role of N=24, the sample size for stage 1. Not so clear seen in your pictures above for OBF or Zheng. Partly now also seen for HP.

❝ I luf open source!

:cool:

@All: Be warned! You need an understanding of all the code in Power2stage to get this up and running.

Regards,

Detlew
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-06-09 17:24
(3215 d 03:16 ago)

@ d_labes
Posting: # 14934
Views: 17,284
 

 Histogram beast

Dear Detlew!

❝ A better way would be to return the vector of ntot (your suggestion in our correspondence) …


Well, you convinced me that it doesn’t makes sense to return such a large vector (con­sisting easily of 1 mio elements).

❝ … instead of some histogram-data with standard settings.


Maybe. Maybe not. IMHO, it is good practice to select the number of bins based on the range of data. See help(nclass.Sturges)

❝ That would give you all freedom to design the histogram to your needs.


True.

❝ Example: 100 bins in the histogram


IMHO, that’s a bit over the top. ;-)
I guess that 50 are more than enough to prescind the first stage and show the log-dis­tri­bution of sample sizes in the second stage. Well, log="x" rightly throws warnings since this is not a histogram proper any more!

[image]

❝ Note the pronounced and outstanding role of N=24, the sample size for stage 1. Not so clear seen in your pictures above for OBF or Zheng. Partly now also seen for HP.


Righty-right! With some tweaks:

[image]

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
Dr_Dan
★★  

Germany,
2015-05-29 10:33
(3226 d 10:07 ago)

@ Dr_Dan
Posting: # 14879
Views: 18,277
 

 Does unequal alpha distribution make sense?

Dear Helmut
For a non-statistician the whole problem is very complicated. The method according to Haybittle/Peto

Stage I:  0.001 leading to 99.8% CI
Stage II: 0.049 leading to 90.2% CI

may lead to unacceptable inflation of the TIE, so far I understood. As you explained in your pile of half-baked manuscripts you explored adjusted alphas which maintain the TIE. For a “type 1” TSD, GMR 0.95, target power 80%, α1 0.001 you found an adjusted α2 of 0.0413 suitable to maintain the patient’s risk. Using this method which kind of CI do you have for the second stage?

Stage I:  0.001 leading to 99.8% CI
Stage II: 0.0413 leading to 9?.?% CI ???

For the sample size calculation for the second stage you use the GMR 0.95 or the GMR as calculated from stage I results?

Kind regards
Dr_Dan

Kind regards and have a nice day
Dr_Dan
ElMaestro
★★★

Denmark,
2015-05-29 11:31
(3226 d 09:09 ago)

@ Dr_Dan
Posting: # 14882
Views: 18,548
 

 Does unequal alpha distribution make sense?

Hi Dr_Dan,

❝ Stage I:  0.001 leading to 99.8% CI

❝ Stage II: 0.0413 leading to 9?.?% CI ???


It is generally a 1-2*alpha confidence interval so in this case the coverage is 91.74%.

❝ For the sample size calculation for the second stage you use the GMR 0.95 or the GMR as calculated from stage I results?


So far no method has been published where the observed GMR can be used and where power is not suffering and/or sample size going through the roof. And equally troubling noone has published the proof that using the observed GMR doesn't work. I know at least three different groups of researchers (or two groups and one individual) have been looking at it. I think it is a matter of journals not wishing to publish negative results. Lack of these publications have already caused a few projects to fail as far as I know.

Pass or fail!
ElMaestro
nobody
nothing

2015-05-29 14:56
(3226 d 05:44 ago)

@ ElMaestro
Posting: # 14883
Views: 18,327
 

 Does unequal alpha distribution make sense?

OT: Is that a new trend, to setup trial designs without checking the appropriate statistics to evaluate the results (see also 2-stage scaled BE trials)? Just asking... :-(

Kindest regards, nobody
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-05-29 15:27
(3226 d 05:13 ago)

@ nobody
Posting: # 14884
Views: 18,474
 

 Old beliefs die hard

Hi nobody,

❝ Is that a new trend, to setup trial designs without checking the appropriate statistics to evaluate the results (see also 2-stage scaled BE trials)? Just asking... :-(


No, it isn’t.
As I wrote above »I know some companies routinely applying this “method”«… At least one of them is a very (very!) big player. During two in-house trainings they didn’t believe me (“We never [sic] faced any problems in our sub­mis­sions!”). :not really:

Old beliefs die hard
even when demonstrably false.
   E.O. Wilson

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
nobody
nothing

2015-05-29 15:41
(3226 d 04:59 ago)

@ Helmut
Posting: # 14885
Views: 18,319
 

 Old beliefs die hard

Hmm, I was specifically referring to this here from Denmark:

❝ And equally troubling noone has published the proof that using the observed GMR doesn't work. I know at least three different groups of researchers (or two groups and one individual) have been looking at it. I think it is a matter of journals not wishing to publish negative results. Lack of these publications have already caused a few projects to fail as far as I know.


...or did I get this completely wrong?

Kindest regards, nobody
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-05-29 20:08
(3226 d 00:32 ago)

@ nobody
Posting: # 14886
Views: 18,501
 

 Old beliefs die hard

Hi nobody,

as a deckhand I’m not qualified to speak for our Capt’n. Only some remarks.

❝ ❝ And equally troubling noone has published the proof that using the observed GMR doesn't work. I know at least three different groups of researchers (or two groups and one individual) have been looking at it.


I would say we don’t need a proof here. There is an abundance of literature about fully adaptive methods in superiority testing (i.e., adjusting both for the effect size and vari­ance). Many papers warn about a “too early” interim analysis because the estimate(s) are not reliable enough. However, adaptive methods in Phase III are still valuables tools. Nevertheless, adjusting for two estimates has its price. There is no free lunch.

In the BE context the story is a little bit different. Given the limited sample sizes in the first stage the estimated GMR (and yes, the CV as well…) is not that good. All adaptive methods published so far contain at least one futility rule (papers by Karalis/Macheras, Kieser/Rauch 2015, and our wacky poster doi 10.13140/RG.2.1.5190.0967). A natural one is stopping after the first stage if the GMR is outside the acceptance range1. Another one is to set an upper limit to the estimated total sample size. Note that adding futi­lity rule(s) to any method decreases the type I error (since studies more likely stop). But: The impact on power may be unacceptable (see Anders’ 2013 paper and my 2015 review where I explored Karalis’/Macheras’ methods). In other words, regulators don’t have a problem (since the TIE is preserved) but such methods might be problematic for economic and ethi­cal reasons.

Example: You state a maximum total sample size of 120 in the protocol.
  1. You run some simulations beforehand and know that – contrary to what you expected from the paper(s) – the actual power might be (far) below 60%. Would you give this information to the EC risking that the protocol is rejected – or hide it and cross your fingers?
  2. You estimate the total sample size (based on the GMR and CV) with 124. According to the protocol you would have to stop and throw the first stage’s data into the waste bin. Or would you let the protocol walk the plank, write some :blahblah: like “… in order to compensate for a potentially higher dropout rate …” and continue?
  3. You look at the extensive tables of the publications or perform own simulations. You don’t state any futility rule but are aware that there is some chance that the total sample size will exceed your budget by far (in our Capt’n words: going through the roof). Internally (i.e., following the “Guy in the Armani Suit”) you know that in such a case you will stop the study (every protocol contains such a clause somewhere). Essentially you make the EC believe that the power is 80%, whereas in reality it is <60% (like in #a).
In my experience #c is quite common.

I don’t say that fully adaptive methods are futile, only that they don’t work “out of the box”. If one is able to come up with an educated guess about both the “most likely” and “worst case” CV it makes sense to perform own simulations. Since the chance to stop in the first stage is higher, less adaption of α (i.e., a narrower CI) is necessary. Whether it out­weighs the loss in power2 has to be assessed.
It is amazing that many people are so interested in the GMR in the interim. The same people had no problem to throw away failed fixed sample studies in the last 30+ years (“Too bad that our assumptions were wrong. Let’s perform another study.”)… TSDs are not a jack of all trades device.


  1. Mandatory in fully adaptive methods. Otherwise any software you are using for the sample size estimation will show you the finger.
  2. In my experience rarely (if ever).

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
ElMaestro
★★★

Denmark,
2015-05-29 20:57
(3225 d 23:43 ago)

@ Helmut
Posting: # 14887
Views: 18,176
 

 Old beliefs die hard

Hi Hötzi,

that was another good post. To state it a little differently:
We are trying to build confidence in the point estimate, this is the whole point of the (any) BE study. At a time when we have very little confidence in it we will not be rewarded from trying to make use of it. That is in a nutshell why the two-stage approaches do not work when we try to use the PE from stage 1.

I think we need it published, or some other way to get the message across. Telling companies that it doesn't work, well... doesn't work.
"Mr. Birdsong is not helping the project forward." vs "Mr. Birdsong is offering an opportunity to avoid a sad and brainless waste of money." The interpretation is commonly the former, rarely the latter :-D

Pass or fail!
ElMaestro
nobody
nothing

2015-05-29 22:06
(3225 d 22:34 ago)

@ ElMaestro
Posting: # 14888
Views: 18,225
 

 Old beliefs die hard

...Mr. Birdsong... made my weekend :-D

Kindest regards, nobody
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-06-02 03:52
(3222 d 16:48 ago)

@ ElMaestro
Posting: # 14890
Views: 18,210
 

 ♩ ♪♫♬`·.¸¸.·´`·.¸¸.

¡ Hola el Jeffe !

❝ We are trying to build confidence in the point estimate, this is the whole point of the (any) BE study. At a time when we have very little confidence in it we will not be rewarded from trying to make use of it. That is in a nutshell why the two-stage approaches do not work when we try to use the PE from stage 1.


Yep.

❝ I think we need it published, …


Sooner or later. Mainly (solely?) so satisfy our egos. It is not par­ti­cularly new that ad­justing for the observed GMR sucks in terms of power.1,2 Nobody seemingly reads this stuff… Why should anybody read ours?

❝ … or some other way to get the message across.


Ahem. How / where?

❝ Telling companies that it doesn't work, well... doesn't work.


Yes. See the end of this post. Same company, two countries.
1st training:

»Our statisticians [plural!] told us that there is no problem.« – »??‽!«
These “experts” did not participate. BE too primitive topic for them? Any­thing below the Kullback-Leibler diver­gence seems to be of no interest to them.*

2nd training:

Statistician present. »I have been told that you routinely use Hay­bittle-Peto? Did you try to assess the type I error?«»We always use SAS, which is vali­dated.« – »??‽!«


❝ "Mr. Birdsong is not helping the project forward." vs "Mr. Birdsong is offering an opportunity to avoid a sad and brainless waste of money." The interpretation is commonly the former, rarely the latter :-D


Good story, ♩ ♪♫♬! You speak from my .


    References:
  1. Jennison C, Turnbull BW. Mid-course sample size modification in clinical trials based on the observed treatment effect. Stat Med. 2003;22(6):971–93. doi 10.1002/sim.1457
  2. Tsiatis AA, Mehta C. On the inefficiency of the adaptive design for monitoring clinical trials. Biometrika. 2003;90(2):367–78. doi 10.1093/biomet/90.2.367


Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
Dr_Dan
★★  

Germany,
2015-06-02 10:00
(3222 d 10:39 ago)

@ Helmut
Posting: # 14891
Views: 18,142
 

 TSD only an option in ex­ceptional cases

Dear all,
Thank you very much for the effort you spent to teach me where the pitfalls of a TSD study lie. Fortunately in vain (not the teaching but I just heard that the project is cancelled. I guess performing a study with a supposed CVintra of 50% for AUC and 80% for Cmax is only interesting from a scientific point of view). As you know I am not a friend of TSD and this discussion confirmed my antipathy. IMHO as long as you have a "normal" drug substance you should go with an accepted design. I would use a full replicate design with the possibility to widen the acceptance range (at least for Cmax). The savings in terms of money of a TSD (if any!) do no justify the risks.
Kind regards and thanks again
Dr_Dan

Kind regards and have a nice day
Dr_Dan
nobody
nothing

2015-06-02 11:23
(3222 d 09:16 ago)

@ Dr_Dan
Posting: # 14892
Views: 18,021
 

 TSD only an option in ex­ceptional cases

2 golden rules for surviving in big organisations:
  1. Nobody is ever responsible for errors occurring. There is always some expert (not present in the current meeting, hidden somewhere deep, deep insides some flowchart) which can explain the nonsense going on (for decades).

  2. Never question the professional expertise of someone with a longer job title than yours (global, senior, president, master-of-the-universe whatsoever). This will undermine the hierachy and therefore lead to death (yours, of the organization, choose yourself) within minutes.

Kindest regards, nobody
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-06-02 14:07
(3222 d 06:33 ago)

@ nobody
Posting: # 14893
Views: 17,882
 

 OT

Hi nobody,

  1. Nobody is ever responsible for errors occurring.


Given your nick: My I concur that you are the scapegoat in your company – being responsible for all errors occurring? :-D

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
nobody
nothing

2015-06-02 14:41
(3222 d 05:59 ago)

@ Helmut
Posting: # 14894
Views: 17,880
 

 OT

100% right... :-D

Kindest regards, nobody
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2015-06-02 02:48
(3222 d 17:52 ago)

@ Dr_Dan
Posting: # 14889
Views: 18,499
 

 Banana splits

Hi Dan,

plots from my garbage-collection about asymmetric splits of αadj. GMR 0.95, target power 80%. Stage 1 sample sizes 12–72 and CVs 10–80%; both assessed with a step size of two. 106 sim’s for the Type I Error and 105 for power.
  • Left panels: TIE
    Contour lines at 0.0025. Thick lines enclose the area of significant inflation of the TIE (i.e., >0.05036). Yellow circle location of TIEmax. Gray plane at nominal (target) α 0.05.
  • Right panels: Power
    Lower surface stage 1 and upper surface in the final stage. Contour lines and gray plane at target power 80%.

Haybittle/Peto: misused 0.001/0.049 (CIs 99.80/90.20%)
TIEmax 0.0585 (at n1 12, CV 24%)


[image]

Haybittle/Peto: adjusted 0.001/0.0413 (CIs 99.80/91.74%)
TIEmax 0.0499 (at n1 12, CV 22%)


[image]

O’Brien/Fleming: misused 0.005/0.048 (CIs 99.00/90.40%)
TIEmax 0.0570 (at n1 12, CV 24%)


[image]

O’Brien/Fleming: adjusted 0.005/0.0416 (CIs 99.00/91.68%)
TIEmax 0.0501 (at n1 12, CV 24%)


[image]

Zheng et al.: 0.01/0.04 (CIs 98.00/92.00%)
TIEmax 0.0488 (at n1 12, CV 24%)


[image]

Zheng et al.: adjusted 0.01/0.0412 (CIs 98.00/91.76%)
TIEmax 0.0498 (at n1 12, CV 26%)


[image]

With Haybittle/Peto the TIE may hit all over the place; O’Brien/Fleming per­forms slightly better. Zheng et al. is overly con­ser­va­tive.
Note that these adjustments are applicable to a GMR of 0.95 and target power of 80% only. More de­viation of the GMR from unity and/or higher target power generally requires more adjustment (i.e., lower α2 ⇒ wider CI). I know some people using HP for a GMR of 0.90. :no:

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes
UA Flag
Activity
 Admin contact
22,957 posts in 4,819 threads, 1,636 registered users;
81 visitors (0 registered, 81 guests [including 9 identified bots]).
Forum time: 19:40 CET (Europe/Vienna)

Nothing shows a lack of mathematical education more
than an overly precise calculation.    Carl Friedrich Gauß

The Bioequivalence and Bioavailability Forum is hosted by
BEBAC Ing. Helmut Schütz
HTML5