Hello Everyone, I would like to perform sample size calculation for a bioequivalence study which would aim at assessing drug effect and food effect at the same time (in order to bridge the clinical formulation with the market formulation) with a 3Treatment Williams Design (3x6x3). The study would include 3 treatments: A = the clinical formulation (reference) under fasting status, B = the market formulation (test 1) under fasting status, C = the market formulation (test 2) under fed status The goal of the study is to asses the FE on the market formulation in addition to BE between the market and clinical formulations, so that we would test C vs B and A vs B. Primary Pk parameters for the study would be AUC0inf, AUC0t, and Cmax, for which I have CV intra estimates from a previous single sequence study where subjects took the clinical formulation in a single dose for the first period, and then took repeated doses of the clinical formulation in the second period. My first question would be, do I need to power the study to get average bioequivalence on each of the PK param or just one is sufficient? The FDA guidance "FoodEffect Bioavailability and Fed Bioequivalence Studies" notes that: "For an NDA, an absence of food effect on BA is not established if the 90 percent CI for the ratio of population geometric means between fed and fasted treatments, based on logtransformed data, is not contained in the equivalence limits of 80125 percent for either AUC0inf (AUC0t when appropriate) or Cmax". If I had to compute power to get bioequivalence for all the PK parameters, how should I proceed ? If possible using the package Power.TOST. Then, since I am interested in the comparisons C vs B and A vs B, how should I proceed to get power estimates, likewise using the package Power.TOST? If I remember correctly the EMA advises to analyse the comparisons one at a time, so should I perform a multiplicity adjustment? Since only the marketed formulation would then be prescribed to the patients I would not use multiplicity correction, but since I plan to interpret both comparisons independently I'm not sure... Many thanks for you input ! 
Hi Olivbood, » […] bioequivalence study which would aim at assessing drug effect and food effect at the same time (in order to bridge the clinical formulation with the market formulation) with a 3Treatment Williams Design (3x6x3). » » A = the clinical formulation (reference) under fasting status, » B = the market formulation (test 1) under fasting status, » C = the market formulation (test 2) under fed status » » The goal of the study is to asses the FE on the market formulation in addition to BE between the market and clinical formulations, so that we would test C vs B and A vs B. » » Primary Pk parameters for the study would be AUC0inf, AUC0t, and Cmax, for which I have CV intra estimates from a previous single sequence study where subjects took the clinical formulation in a single dose for the first period, and then took repeated doses of the clinical formulation in the second period. Some obstacles. That was a paired design. Not unusual but you have to assume no period effect. This comparison makes only sense for AUC_{0–τ} (steady state) vs. AUC_{0–∞} (SD), i.e., if you were within some prespecified limits you demonstrated linear PK. Since CV_{intra} of AUC_{0–∞} generally is larger than the one of AUC_{0–t}, no worries. On the other hand what you got from this study is pooled from AUC_{0–∞} and AUC_{0–τ}, where quite often the variability is steady state is lower than the one after a SD. Hence, allow for a safety margin. The most important CV is the one of C_{max}. Here I would allow for an even larger safety margin (in steady state its variability might be substantially lower than after a SD). In other words, the lower CV in steady state dampens the pooled CV and the one you will observe after a SD likely will be higher. In the crossover you will have one degree of freedom less than in the paired design (the CI will be wider). Given, peanuts. Now it gets nasty. In many cases the variability in fed state is (much) higher than in fasted state. I have seen too many studies of generics (sorry) where the fasting study passed (“We perform it first because that’s standard.”) – only to face a failed one in fed state. Oops! Hence, I always recommend to perform the fed study first. For you that’s tricky since you have only data of fasted state. What about a pilot or a twostage design? I prefer the latter because with the former you throw away information. In a TSD you can stop the arms which are already BE in the first stage (likely the fasting part) and continue the others to the second stage. » My first question would be, do I need to power the study to get average bioequivalence on each of the PK param or just one is sufficient? » » The FDA guidance "FoodEffect Bioavailability and Fed Bioequivalence Studies" notes that: » » "For an NDA, an absence of food effect on BA is not established if the 90 percent CI for the ratio of population geometric means between fed and fasted treatments, based on logtransformed data, is not contained in the equivalence limits of 80125 percent for either AUC0inf (AUC0t when appropriate) or Cmax".
If you are not within the limits (C vs. B) the food effect goes into the SmPC (EMA) or the label (FDA). » If I had to compute power to get bioequivalence for all the PK parameters, how should I proceed ? If possible using the package Power.TOST. You only have to observe the one with the highest variability / largest deviation form unity. Yep, doable in Power.TOST . » Then, since I am interested in the comparisons C vs B and A vs B, how should I proceed to get power estimates, likewise using the package Power.TOST? Although you plan for a 3×6×3 Williams’ design, the two parts will be evaluated as incomplete block designs (IBD), having the same degrees of freedom as the conventional 2×2×2 crossover. Hence, in sampleN.TOST() use the argument design="2x2x2" and not design="3x6x3" . You will get a small reward:
Yep, that's fine. 
Hi Helmut, Thanks for your instructive answer ! » The most important CV is the one of C_{max}. Here I would allow for an even larger safety margin (in steady state its variability might be substantially lower than after a SD). In other words, the lower CV in steady state dampens the pooled CV and the one you will observe after a SD likely will be higher. » In the crossover you will have one degree of freedom less than in the paired design (the CI will be wider). Given, peanuts. I see, would you know of any guidance or "good practice" on how to determine this safety margin? » Now it gets nasty. In many cases the variability in fed state is (much) higher than in fasted state. I have seen too many studies of generics (sorry) where the fasting study passed (“We perform it first because that’s standard.”) – only to face a failed one in fed state. Oops! Hence, I always recommend to perform the fed study first. For you that’s tricky since you have only data of fasted state. What about a pilot or a twostage design? I prefer the latter because with the former you throw away information. In a TSD you can stop the arms which are already BE in the first stage (likely the fasting part) and continue the others to the second stage. That's interesting ! As far as I know no BE study is planned in the fed state (since the clinical formulation is only administered in fasting state), but I'll pass on the note. » (what does “when appropriate mean”?). No idea neither :) » You only have to observe the one with the highest variability / largest deviation form unity. Yep, doable in Power.TOST . I see, so I would only need to power the study for (supposedly) Cmax, and not use for instance the function power.2TOST to calculate the power of two TOST procedure for Cmax and AUC (either 0inf or 0t)? » » Then, since I am interested in the comparisons C vs B and A vs B, how should I proceed to get power estimates, likewise using the package Power.TOST? » » … so should I perform a multiplicity adjustment? Since only the marketed formulation would then be prescribed to the patients I would not use multiplicity correction, but since I plan to interpret both comparisons independently I'm not sure... » » Yep, that’s fine. So, no need to perform multiplicity adjustment is that right ? I tried to find some publications on the issue but since it is neither a joint decision rule nor a multiple decision rule situation, so far no luck... Thanks again for your help ! 
Hi Olivbood, » Thanks for your instructive answer ! My pleasure. » » The most important CV is the one of C_{max}. Here I would allow for an even larger safety margin […]. » » I see, would you know of any guidance or "good practice" on how to determine this safety margin? No idea. If you have variability/ies you could use an upper confidence limit of the CV (that’s what I do). The higher the sample size of the previous study/ies, the lower the uncertainty of the CV and hence, the tighter the CI of the CV. Various approaches in PowerTOST :
The default alpha of function CVCL() calculates a onesided 95% CI. Pretty conservative (2^{nd} row).In the spirit of a producer’s risk of 20% I try to convince my clients to use an 80% CI (3^{rd} row). Gould* recommends a 75% CI (4^{th} row). You can walk the Bayesian route as well (5^{th} row). Up to you. » » Now it gets nasty. In many cases the variability in fed state is (much) higher than in fasted state. []. » » That's interesting ! As far as I know no BE study is planned in the fed state (since the clinical formulation is only administered in fasting state), but I'll pass on the note. Well… If already known that you will write sumfink like “X has to be administered on an empty stomach” in the SmPC/label, why bother about trying to demonstrate lacking food effects? If you find one, write in the SmPC/label “Food de/increases the absorption of X by y%”. If you want – only for marketing purposes (better compliance) – to write “Food does not influence the absorption of X. However, administration on an empty stomach is recommended.” you have to power the study to be within the limits in order to support such a claim. Unfortunately your story does not end here. Actually that’s only the start. Power (and therefore, the sample size) is much more dependent on the PE than on the CV. Try this:
You have no data about fed/fasting so far. What now? With the Bayesian approach you could take the uncertainty of the PE into account as well:
» » You only have to observe the one with the highest variability / largest deviation form unity. Yep, doable in Power.TOST . » » I see, so I would only need to power the study for (supposedly) Cmax, and not use for instance the function power.2TOST to calculate the power of two TOST procedure for Cmax and AUC (either 0inf or 0t)? No. You have to show BE for all PK metrics with α 0.05 each and the IntersectionUnionTest (IUT) maintains the type I error. Just estimate the sample size for the most nasty PK metric. I feel a little bit guilty since I was such a nuisance to Detlew and Benjamin. They worked hard to implement power.2TOST() . The idea sounds great but the correlation of say, AUC and C_{max} is unknown in practice. See here and there. For a while we were even considering removing it from PowerTOST …» So, no need to perform multiplicity adjustment is that right ? Yes.
Yes. 
Thanks again for your answers Helmut, that's really helpful ! Unfortunately I am not sure to completely understand the following : » Although you plan for a 3×6×3 Williams’ design, the two parts will be evaluated as incomplete block designs (IBD), having the same degrees of freedom as the conventional 2×2×2 crossover. Hence, in sampleN.TOST() use the argument design="2x2x2" and not design="3x6x3" . You will get a small reward:
If my understanding is correct, sampleN.TOST(CV=0.3, design="3x6x3", targetpower=0.9, print=FALSE)[["Sample size"]] would give me the sample size necessary for a power of 90% to detect BE for at least one given comparison (either A vs B or C vs B) if I evaluated all data at the same time, while sampleN.TOST(CV=0.3, design="2x2x2", targetpower=0.9, print=FALSE)[["Sample size"]] would give me the sample size for a given comparison if I evaluated the data two at a time (keeping in mind that I should potentially increase the result so that the sample size is a multiple of 6).However, what should I do if I want the sample size for a power of 90% to detect BE for both comparisons ? If I assume that there is not correlation between both comparisons, and be n = sampleN.TOST(CV=0.3, design="2x2x2", targetpower=0.9, print=FALSE)[["Sample size"]] , then:power.TOST(CV=0.3, design="2x2x2", n=n) * power.TOST(CV=0.3, design="2x2x2", n=n) would evaluate to less than 0.9.Is it correct to calculate the sample size only based on one comparison? Furthermore, while the two parts of the trial will be evaluated as incomplete block designs, it seems to me that the original sequences and periods are preserved (e.g. an observation from period 3 is still coded as period 3), so that the degree of freedom would not be the same as for the conventional 2x2x2 crossover, no? Lastly, when you said no multiplicity adjustment procedure was needed, was it implied that I should specify that BE between the clinical and marketed formulation will be tested first, before the food effect (i.e. hierarchical testing)? It is just to be in accordance with EMA and FDA. 
Hi Olivbood, » If my understanding is correct, » sampleN.TOST(design="3x6x3", ...[["Sample size"]] would give me the sample size necessary for a power of 90% to detect BE for at least one given comparison (either A vs B or C vs B) if I evaluated all data at the same time, while sampleN.TOST(design="2x2x2", ...)[["Sample size"]] would give me the sample size for a given comparison if I evaluated the data two at a time…Correct. » … (keeping in mind that I should potentially increase the result so that the sample size is a multiple of 6). Oops! So no incentive in terms of sample size in this example. » However, what should I do if I want the sample size for a power of 90% to detect BE for both comparisons ? If I assume that there is not correlation between both comparisons, and be n = sampleN.TOST(CV=0.3, design="2x2x2", targetpower=0.9, print=FALSE)[["Sample size"]] , then:» power.TOST(CV=0.3, design="2x2x2", n=n) * power.TOST(CV=0.3, design="2x2x2", n=n) would evaluate to less than 0.9.The other way ’round. If you assume no correlation, overall power ~ power of the part driving the sample size (higher CV and/or worse PE). If correlation = 1, then overall power = power_{1} × power_{2}. I suggest to use function sampleN.2TOST in this case…library(PowerTOST) … which gives fun CV.1 CV.2 theta0.1 theta0.2 rho n1 n2 n pwr.1 pwr.2 pwr Maybe the ratios and CVs are not identical. Try: CV.1 < 0.20 # fasting might be lower Then you end up with this: fun CV.1 CV.2 theta0.1 theta0.2 rho n1 n2 n pwr.1 pwr.2 pwr » Is it correct to calculate the sample size only based on one comparison? Depends on what you want. If you want to show similarity of both (BE and lacking food effects), no. If not, base the sample size on the BEpart. » Furthermore, while the two parts of the trial will be evaluated as incomplete block designs, it seems to me that the original sequences and periods are preserved (e.g. an observation from period 3 is still coded as period 3), so that the degree of freedom would not be the same as for the conventional 2x2x2 crossover, no? Correct. Has some strange sideeffects (see there). » Lastly, when you said no multiplicity adjustment procedure was needed, was it implied that I should specify that BE between the clinical and marketed formulation will be tested first, before the food effect (i.e. hierarchical testing)? No. These tests aim a completely different targets. Hence, the order is not relevant (though in practice, if the BEpart fails the fed/fasting is history). » It is just to be in accordance with EMA and FDA. Not sure what you mean here. — Cheers, Helmut Schütz The quality of responses received is directly proportional to the quality of the question asked. ☼ Science Quotes 
To repeat the recommendation in this post: The degrees of freedom are different for the 2x2 design and the design of the TaaTP. We can mimic the df's, at least approximately, if we use the robust df's. (TaaTP: TwoataTime Principle) But makes seldom a difference worth thinkin' about: library(PowerTOST) 2 subjects more but due to balanced design, i.e. sample size has to be a multiple of 6. If we go for an unbalanced design we have also a power of 90% with 52 subjects: power.TOST(CV=0.3, design="3x6x3", robust=TRUE, n=52) 
Many thanks to both of you ! I come back yet again with another question. Since the number of treatments is relatively high, the number of blood samples drawn from each subject would be quite high and the study would be time consuming (thus subjects would be more likely to dropout). In such a case, a balanced incomplete block design with 2 periods and 6 sequences( where each subject would take only 2 treatments) is appealing. However, I guess going with such a design have its own disadvantages such as an increase in sample size and unbalance between the sequences would be more problematic. Do you know if such a design is practicable and likely to be accepted by the regulatory agencies? I made some researched but did not find anything convincing. Also, how should I proceed to power such a study, if possible using the package powerTOST? Would I be bound to perform simulations and base my sample size on expected power? Thanks ! 
Hi Olivbood, » Since the number of treatments is relatively high, the number of blood samples drawn from each subject would be quite high and the study would be time consuming (thus subjects would be more likely to dropout). Yes and no. The loss in power due to dropouts is overrated by many. Try the function pa.ABE() in PowerTOST . Unless the drug is nasty (i.e., drugrelated AEs leading to withdrawals) it should not hurt. Concerning the blood volume: I just performed a five (!) period Xover with 19 samples/period. Total blood volume (including pre / poststudy clin. chemistry) was 440 mL. In short: Develop a better bioanalytical method. » In such a case, a balanced incomplete block design with 2 periods and 6 sequences( where each subject would take only 2 treatments) is appealing. » However, I guess going with such a design have its own disadvantages such as an increase in sample size and unbalance between the sequences would be more problematic. Maybe. Though BIBDs are mentioned in most textbooks of BE, I didn't see a single one in my entire career. Phase III is another story, of course. » Do you know if such a design is practicable and likely to be accepted by the regulatory agencies? (1) Duno and (2) why not? However, expect questions since assessors likely are not familiar with their application in BE. » Also, how should I proceed to power such a study, if possible using the package powerTOST? Would I be bound to perform simulations and base my sample size on expected power? Good question, next question. 