## Evaluation ☑️ [Surveys]

Dear all,

here are the results of 101 respondents as of 23 October.
THX to all participants (the survey is closed). 1

Below the questions percentage of complete answers in decreasing order. Since in some questions multiple choices are possible, the percentage can be >100%. Not disclosed information excluded.

Below each question my very personal opinions on the outcome.
1. How do you estimate the sample size?
○ Software

64%
○ Both
34%
○ Sample size tables
2%
• If one uses solely sample sizes tables, how to deal with not available combinations (T/R-ratio, CV, power)? Power is highly nonlinear and therefore, interpolation difficult.

2. Which software do you use?
☐ Open source (e.g., R-packages like PowerTOST, bear, …)

53%
☐ Free (e.g., FARTSSIE, EFG, …)
44%
☐ Commercial (off-the-shelf, e.g., SAS Proc Plan, NQuery Advisor, PASS, StudySize, …)
31%
☐ In-house (e.g., own SAS-macros, R, C, Excel-template, …)
14%
☐ Web-based
3%
☐ Optional: Please give the software you use most (incl. version, year of release)
PowerTOST (5), FARTSSIE (3), SAS (3), PASS (2), Julia: ClinicalTrialUtilities (1), NQuery (1), SPSS (1),
Statistics101 Resampling Simulator (1), WinNonlin (1)
Open source and free software 2 is in the lead. Only because it comes at no cost? No commercial software provides sample sizes for the reference-scaling methods out of the box (though it is possible to write code in SAS or MatLab). One respondent simulated higher-order designs and incomplete block designs (>4 treatments), which are not available in any software.
Phoenix WinNonlin provides only post hoc power. Hence, how is the sample size estimation done?
Would be interesting which web-based method is used. I know only ones for parallel groups based on the large sample approximation.

3. How often do you update the software you use most?
○ Occasionally

41%
○ Regularly
36%
○ Never
20%
• It is not a good idea to never update the software (20%). Not only that bugs might have been corrected in the meantime, new methods might have been added as well (note that in Q10 20% reported that a planned design is not available).
Given the paranoia of IT departments, in some companies updating may be a cumbersome task. However, try to be not more than two releases behind the current one. The worst I ever have seen was a fifteen years old one…

4. Is the software you use most validated?
☐ Cross-validated with other software

30%
☐ Comparison with sample size tables
29%
☐ IQ (Installation Qualification acc. to procedures provided by the vendor)
23%
☐ No
23%
☐ PQ (Performance Qualification)
16%
☐ OQ Type 1 (Operational Qualification acc. to procedures provided by the vendor)
14%
☐ Partly (i.e., only some of the procedures)
10%
☐ OQ Type 2 (Operational Qualification acc. to own pre-specified procedures)
6%
5%
• Interesting that almost ¼ of respondents reported that the software is not validated. Only fine if you are a regulator yourself. If you are a sponsor I would require that the CRO’s is. If you are with a CRO, I recommend to do something (at least compare with sample size tables). Some respondents answered that the software passed IQ and OQ Type 1.
I would never trust in validation routines provided by the vendor alone (a.k.a. rubbish in, rubbish out).

5. Were you ever asked by a regulatory agency about software validation?
☐ No

82%
☐ Yes
15%
Four answers: 2019 (2), 2018, 2017
• Seemingly regulators don’t care much. IMHO, regulators and members of IECs should (in assessing the protocol). According to ICH E9 “The number of subjects in a clinical trial should always be large enough to provide a reliable answer to the questions addressed.” Seemingly some – flawed – software routines give a higher than required sample size. That’s nice for the applicant but ethically questionable.

6. Do you repeat the estimation in-house if provided by an external entity (CRO, sponsor, consultant)?
☐ Always

55%
☐ Regularly
21%
☐ Sometimes
17%
☐ Never
7%
• Trust is good, control is better. (Russian proverb)
I suggest to always repeat the estimation. Takes a minute and prevents surprises later.

7. Do you perform a Sensitivity Analysis in order to assess the impact on power if in the study values (e.g., T/R-ratio, CV, number of dropouts) will deviate from assumptions?
○ Always

37%
○ Sometimes
34%
○ Never
14%
○ I don’t know what a Sensitivity Analysis is
11%
• A sensitivity analysis is recommended by ICH E9 (Section 3.5) and E9(R1). The 14% reporting to never perform one are possibly believers of the “carved in stone” approach (i.e., that the assumed values are true ones and will be exactly realized in the study). That’s extremely risky, esp. if the T/R-ratio will turn out to be worse than assumed. The impact on power is massive. Hence, I suggest to perform always a sensitivity analysis (as 37% already do).
If you use PowerTOST, I recommend its functions pa.ABE() and pa.scABEL() as a starter. Examples are given in the vignette.

8. Do you increase the estimated sample size according to the expected dropout rate?
○ Yes (chosen by management)

37%
○ Yes (formula: n’ = n × (100 + dropout-rate in %) / 100)
29%
○ Yes (formula: n’ = n / (100 – dropout-rate in %) × 100)
22%
○ Yes (as provided by the software; I don’t know the formula)
9%
○ No (since the impact on power is limited)
3%
• Bad that the management increases the sample size (based on gut feeling, reading tea leaves, budget, or what?). IMHO, FARTSSIE’s is not a good idea. To quote Stephen Senn:
Power. That which statisticians are always calculating but never have.
Regrettably 29% fell into the trap and used the wrong formula to adjust the sample size n’ (if one faces exactly the anticipated dropout-rate and all other assumptions are correct, the number of eligible subjects can be too low). The dropout-rate is based on dosed subjects and hence, 22% used the correct formula (see also
For the 9% don’t knowing what the software does: At least in PASS the correct formula is implemented.

9. Please give general problems that you faced in sample size estimation.
☐ Estimated sample size was substantially smaller/larger than expected
(compared to PARs / other studies)

41%
☐ Result of re-assessment differed from the estimate given (by CRO, sponsor, consultant)
37%
☐ Software, version, setup not given (by CRO, sponsor, consultant)
28%
☐ Other (please give a short description)
10%
• Differences to PARs are not uncommon and to other studies (if not your own) as well. The clinical setup might differ as does the bioanalytical method. Another good reason to perform a sensitivity analysis. Perhaps own assumptions were too optimistic? Sometimes own estimations give a much larger sample size than the study of the PAR had. It might well be that it passed by sheer luck. Always check the power of the study.
Amazing that the sample size estimation differed from the external one in so many cases. If information is not given, one is fishing in the dark (which one is correct?). Ask for it.

10. Did you face problems with the software you use most?
☐ No

56%
☐ Planned design not available
20%
☐ User manual insufficient
(too short/verbose, methods not/poorly documented, lacking/outdated references, …)

16%
☐ Only one design-variant provided (although alternatives exist)
9%
☐ Methods based on simulations not reproducible (e.g., for reference-scaling)
9%
☐ Operation is complicated
8%
4%
• Soothing that the majority is happy with the software.
I’m asking myself: What have the 20% done (since, e.g., the partial replicate is not available in PASS)? Trust in the external one? If it’s commercial software I suggest to ask for an implementation (and cross fingers). The same holds for authors of free and open source software. In my experience they are more responsive than vendors of commercial software.
IMHO, user manuals are a weak point of any software.
PowerTOST’s simulation methods are reproducible since a fixed seed is issued by default (don’t change to setseed = FALSE). If you use your own code, make sure to use a fixed seed as well.
I came across debates between sponsors and CROs about designs because e.g., PASS provides only three setups for replicate designs: ABBA|BAAB, AABB|BBAA|ABBA, and ABB|BAA. The second one should be avoided (FDA 2001, Appendix B 1). Since the design constants and degrees of freedom are identical, the first one covers all 4-period 2-sequence replicates (ABAB|BABA, ABBA|BAAB, and AABB|BBAA) and the third one both 3-period 2-sequence replicates (ABA|BAB and ABB|BAA).
To be clear: Simulation-based methods for reference-scaling (HVD(P)s and NTIDs) are currently not implemented in any of the commercial packages, FARTSSIE (since v2.5 the code for PowerTOST is given), EFG, R-package bear.
Beauty is in the eye of the beholder. It’s a matter of taste whether clicking through menus or providing arguments in the R-console is considered complicated.

1. Since the survey is not public (I’ve send out invitations by e-mail): Some participants answered only the first question and skipped all the others. That’s not helpful.
2. Not like in “free beer” but like in “free speech”.

The IP is not recorded, only the country:
India (27), Russia (12), Germany (10), Spain (8), Czechia (7), Jordan (5), USA (5), Austria (2), China (2), Mexico (2), The Nether­lands (2), Po­land (2), Ukraine (2), Australia (1), Bela­rus (1), Brazil (1), Den­mark (1), Egypt (1), France (1), Greece (1), Italy (1), Por­tu­gal (1), Slo­venia (1), South Africa (1), Taiwan (1), Tanzania (1), Turkey (1), UK (1), Uruguay (1).

Dif-tor heh smusma 🖖
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes