Helmut
★★★

Vienna, Austria,
2018-07-12 17:47

Posting: # 19039
Views: 2,666

## Validation of PhEq_bootstrap [Software]

Dear all,

I was asked whether it is possible to validate PhEq_bootstrap (see this post). Sure. Since the source-code is available and if one is familiar with Object Pascal even a “white-box” validation is doable.
I had a quick look based on the example data sets given by Shah et al.* (which is referred in the software’s documentation).

I got for the 90% CI (PhEq_bootstrap v1.2, 2014-06-22, 64bit Windows):

        500 Bootstraps              1,000 Bootstraps # Shah et al.   PhEq_bootstrap  Shah et al.   PhEq_bootstrap 1 52.79, 68.15   53.10, 67.65   53.01, 68.34   52.82, 67.85 2 48.33, 53.68   48.01, 53.50   48.25, 53.69   48.16, 53.64 3 48.56, 54.10   48.40, 54.49   48.54, 54.56   48.12, 54.14 4 48.39, 51.55   48.23, 51.45   48.38, 51.59   48.31, 51.51 5 46.11, 50.09   45.90, 49.69   46.05, 50.04   46.00, 50.00

Good news. With one exception (test batch 1, 500 bootstraps) the lower CL is more conservative than the ones reported by Vinod. With 1,000 bootstraps results are always slightly more conservative.

Now for the bad news. There is one thing stupid in the software when it comes to validation (line 535 of the source Conv_to_learn1.pas):

RandSeed:=random(random(1000000))+random(random(1000000));

That means that the seed of the (pseudo)random number generator is always a random number itself. Therefore, it is impossible to reproduce runs (try it: you will not get exactly my results).
IMHO, that’s bad coding practice. In our R-packages (PowerTOST, Power2Stage) we have an option to work either with a fixed or a random seed. With the former (which is the default) it is possible to reproduce a given run. Detlew’s randomizeBE is even more flexible. You can work with a random seed and the output gives the seed used. Then you can use this seed as an argument to reproduce the run. Clever.

OK, how reproducible are runs of PhEq_bootstrap?

run      5,000 Bootstraps    25,000 Bootstraps          ƒ2*     90% CI       ƒ2*     90% CI  1     59.793 52.77, 67.66  59.839 52.81, 67.72  2     59.786 52.75, 67.70  59.840 52.81, 67.72  3     59.864 52.82, 67.71  59.839 52.81, 67.72  4     59.829 52.77, 67.75  59.840 52.81, 67.72  5     59.840 52.81, 67.70  59.839 52.81, 67.72  6     59.850 52.82, 67.73  59.843 52.81, 67.72  7     59.823 52.77, 67.71  59.842 52.81, 67.72  8     59.843 52.84, 67.70  59.839 52.81, 67.72  9     59.859 52.77, 67.75  59.836 52.81, 67.72 10     59.803 52.75, 67.70  59.841 52.81, 67.72 11     59.864 52.82, 67.75  59.836 52.81, 67.72 12     59.805 52.75, 67.71  59.840 52.81, 67.72 13     59.860 52.84, 67.72  59.836 52.81, 67.72 14     59.843 52.77, 67.73  59.839 52.81, 67.72 15     59.808 52.77, 67.66  59.840 52.81, 67.72 min    59.786 52.75, 67.66  59.836 52.81, 67.72 Q I    59.806 52.77, 67.70  59.839 52.81, 67.72 med    59.840 52.77, 67.71  59.839 52.81, 67.72 Q III  59.855 52.82, 67.73  59.840 52.81, 67.72 max    59.864 52.84, 67.75  59.843 52.81, 67.72 CV (%) 0.0449 0.063  0.041  0.0034 0.000  0.000

Not so nice. With 5,000 bootstraps the range of the lower CL is 0.07. Imagine that you claimed similarity (ƒ2 50) and an inspector asks you to demonstrate how you calculated it. You fire up the software with its default of 5,000 bootstraps and are slapped in the face with:

f2* = 49.93  Similarity NOT confirmed: Lower CI is below the limit (f2 = 50)

Oops.
Bootstrapping converges to the true value with increasing repetitions. If you want to get reproducible results you have to use a workaround and go for 25,000+ bootstraps. Then the results are stable but you have to accept ~80MB result-files…
Of course this is only valid for these data sets (12 tablets, 4 sampling time points). Good luck with more time points.

• Shah VP, Tsong Y, Sathe P, Liu J-P. In Vitro Dissolution Profile Comparison—Statistics and Analysis of the Similarity Factor, f2. Pharm Res. 1998;15(6):889–96. doi:10.1023/A:1011976615750

Cheers,
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
ElMaestro
★★★

Belgium?,
2018-07-12 22:21

@ Helmut
Posting: # 19040
Views: 2,276

## Validation of PhEq_bootstrap

Hi Hötzi,

» Bootstrapping converges to the true value with increasing repetitions. If you want to get reproducible results you have to use a workaround and go for 25,000+ bootstraps. Then the results are stable but you have to accept ~80MB result-files…
» Of course this is only valid for these data sets (12 tablets, 4 sampling time points). Good luck with more time points.

The magnitude of the variability of the estimate is dependent on the number of bootstraps and the within- and between-variability. This is just the nature of it.

What I don't quite get is how the need for 80 Mb files (files, really???) arises. For 25000 bootstraps you should only need to allocate e.g. 25000x a double datatype (8 bytes) for the storage of f2*, which isn't even a megabyte from the heap. But note I did not care to read any of the Pascal source code. However, speed or memory consumption is a mundane issue. If the calculation is proper then no need to whine at all.

When I do this in C, I allocate 8 Mbs from the heap, and it takes a blink of an eye or two with a million bootstraps and sorting, and no intermediary files anywhere, and it provides derivation of the bootstrap CI in various flavours: Raw, bias corrected, and bias corrected and accelerated.

I could be wrong, but...
Best regards,
ElMaestro
Helmut
★★★

Vienna, Austria,
2018-07-13 01:29

@ ElMaestro
Posting: # 19041
Views: 2,273

## Validation of PhEq_bootstrap

Hi ElMaestro,

» The magnitude of the variability of the estimate is dependent on the number of bootstraps and the within- and between-variability. This is just the nature of it.

Yessir, rightyright!

» What I don't quite get is how the need for 80 Mb files (files, really???) arises.

Dokumentation, darling! The file contains not only the four estimates (ƒ1, ƒ2, unbiased ƒ2, ƒ2*) in full precision for the X bootstraps but also the raw data (+average, variance) of all samples. That’s a fucking lot.

» For 25000 bootstraps you should only need to allocate e.g. 25000x a double datatype (8 bytes) for the storage of f2*, which isn't even a megabyte from the heap.

Sure. The memory consumption is low, regardless which software you use.

» But note I did not care to read any of the Pascal source code.

No need for that. Download the stuff and give it a try.

» However, speed or memory consumption is a mundane issue. If the calculation is proper then no need to whine at all.

ACK. The code is reasonably fast. The documentation states “Please bear in mind that the algorithm is very demanding in terms of CPU time, therefore observe progress bar at the bottom and be patient…” On my machine 25,000 bootstraps took ~3 seconds.

» When I do this in C, I allocate 8 Mbs from the heap, and it takes a blink of an eye or two with a million bootstraps and sorting, …

Oh dear! On my machine 1 mio took ~1 hour with ~90% of the time spent on sorting. My Pascal is rusty but lines 162–181 of the source stinks of Bubble Sort. In R I would likely opt for a parallelized version of QuickSort.

» … and no intermediary files anywhere, …

Making validation interesting. 1 mio produced a nice 3GB result file and a 63MB log-file. Print the ~18 mio lines of the result file and add an impressive amount of paper to the dossier.

» … and it provides derivation of the bootstrap CI in various flavours: Raw, bias corrected, and bias corrected and accelerated.

Congratulations!

Cheers,
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
ElMaestro
★★★

Belgium?,
2018-07-13 10:14

@ Helmut
Posting: # 19043
Views: 2,295

## Validation of PhEq_bootstrap

Hi Hötzi,

» Dokumentation, darling! The file contains not only the four estimates (ƒ1, ƒ2, unbiased ƒ2, ƒ2*) in full precision for the X bootstraps but also the raw data (+average, variance) of all samples. That’s a fucking lot.

Yes, that is a lot. Perhaps a lot of time is spent writing the datasets to file streams and flushing file buffers.

Still, let us say 6 time points, 12 chambers, 2 formulation: ~144 Mb, plus the array of f2*. Allocating that sort of memory should not be an issue ever on newer hardware. The browser you have opened right now to read this post consumes a lot more, probably by a factor 4-10

What is the use of having all that info cached??

Under bootstrapping the i'th dataset is as valid as the j'th dataset where (i,j) is in ([0;B], [0;B]). When I write resampling code I usually just extract a few dataset and print e.g. the 111'th sample or whatever and check that it has the right set of elements (i.e. that the 111'th dataset (or whatver) for Test only has Test values at the 'right' time points and so forth). And I can check that the derived f2* is one I can reproduce by manual calculation in another application. It isn't such a lot of work. But if I had to do that for a million datasets, then I'd be very, very late for dinner.

What I find little more tricky is to convince myself that the derivation of CI's from the vector of f2* values is valid. A key element here is the sorting. What I did once was to output the sorted first 65536 rows of the sorted array and import them in a spreadsheet (if I recall correctly 65536 is the maximum number of rows in my spreadsheet). Then copy those values to another column, sort this column, and check if row difference exist. A row difference would indicate a sorting mishap in one of the applications. Never observed it, fortunately.

I could be wrong, but...
Best regards,
ElMaestro
Helmut
★★★

Vienna, Austria,
2018-07-13 12:54

@ ElMaestro
Posting: # 19045
Views: 2,230

## Validation of PhEq_bootstrap

Hi ElMaestro,

» » Dokumentation, darling! […]
» Yes, that is a lot. Perhaps a lot of time is spent writing the datasets to file streams and flushing file buffers.

This was my first guess as well. I assumed that the author used a HD and the 25,000 on my machine were so fast because I have SSDs. Then I inspected the source and observed what happened with the 1 mio bootstraps. The file is allocated with the computed size, then bootrapping is performed, the array sorted, and at the end written to the file. No disk IO, everything in RAM.

» […] ~144 Mb, plus the array of f2*. Allocating that sort of memory should not be an issue ever on newer hardware.

Sure.

» What is the use of having all that info cached??

Cause at the end it has to be written to the file. Not my code but I would include options to specify what – if anything at all – has to be documented.

» Under bootstrapping the i'th dataset is as valid as the j'th dataset where (i,j) is in ([0;B], [0;B]). When I write resampling code I usually just extract a few dataset and print e.g. the 111'th sample or whatever and check that it has the right set of elements (i.e. that the 111'th dataset (or whatver) for Test only has Test values at the 'right' time points and so forth).

Yep.

» What I find little more tricky is to convince myself that the derivation of CI's from the vector of f2* values is valid. A key element here is the sorting.

Not only that. Getting the CI from the array index needs interpolation. More further down.

» What I did once was to output the sorted first 65536 rows of the sorted array and import them in a spreadsheet (if I recall correctly 65536 is the maximum number of rows in my spreadsheet). Then copy those values to another column, sort this column, and check if row difference exist. A row difference would indicate a sorting mishap in one of the applications.

Checking sorting? Kudos!
Tricky is the CI. Try it with a very small data set and use the spreadsheet’s PERCENTILE() function. I was suspicious when I looked at the source of PhEq_bootstrap whether it will be the same. Example in R:

x      <- c(51.6730571183459, 52.3852052174518, 51.4902456409739,             52.1787527612649, 50.9739796238878, 52.1273407615101,             52.6282854486968, 51.5029428204135, 51.5056701956140,             52.5994758210502, 52.3035556756045, 51.6613007946535) n      <- length(x) mean.x <- mean(x) var.x  <- var(x) sd.x   <- sqrt(var.x) alpha  <- 0.05 # for two-sided 90% CI CI.col <- c(paste0(100*alpha, "%"), paste0(100*(1-alpha), "%")) err.z  <- qnorm(1-alpha)*sd.x/sqrt(n)      # based on normal distribution err.t  <- qt(1-alpha, df=n-1)*sd.x/sqrt(n) # based on t-distribution CI.z   <- c(mean.x-err.z, mean.x+err.z) CI.t   <- c(mean.x-err.t, mean.x+err.t) names(CI.t) <- names(CI.z) <- CI.col CI.rep <- c(lower=50.97, upper=52.60) # reported 90% CI result <- matrix(c(CI.z, CI.t, rep(FALSE, 2), rep(FALSE, 2)), nrow=2, ncol=4,                  dimnames=list(c("CI.z", "CI.t"), c(CI.col, "lo", "hi"))) for (type in 1:9) {   CI.boot.type. <- c(quantile(x, probs=c(alpha, 1-alpha), type=type),                      rep(FALSE, 2))   result <- rbind(result, CI.boot.type.) } row.names(result)[3:11] <- paste0(row.names(result)[3:11], 1:9) for (j in 1:nrow(result)) {   if (round(as.numeric(result[j,  "5%"]), 2) == CI.rep[1])     result[j, "lo"] <- TRUE   if (round(as.numeric(result[j, "95%"]), 2) == CI.rep[2])     result[j, "hi"] <- TRUE } print(as.data.frame(result)) # 1=match, 0=no match                      5%      95% lo hi CI.z           51.67159 51.64886  0  0 CI.t           52.16671 52.18944  0  0 CI.boot.type.1 50.97398 52.62829  1  0 CI.boot.type.2 50.97398 52.62829  1  0 CI.boot.type.3 50.97398 52.59948  1  1 CI.boot.type.4 50.97398 52.61100  1  0 CI.boot.type.5 51.02561 52.62540  0  0 CI.boot.type.6 50.97398 52.62829  1  0 CI.boot.type.7 51.25793 52.61244  0  0 CI.boot.type.8 50.97398 52.62829  1  0 CI.boot.type.9 50.97398 52.62829  1  0

R and SAS provide various types of interpolations. PhEq_bootstrap uses “Type 3” (in R lingo) which, AFAIK, is also the default in SAS (PCTLDEF=5). All spreadsheets I have tested (Excel, OO Calc, Gnumeric) use “Type 7”. Of course, with already moderate samples differences will practically disappear. But checking sumfink like this is the fun of validation, right?

Cheers,
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
ElMaestro
★★★

Belgium?,
2018-07-13 14:28

@ Helmut
Posting: # 19048
Views: 2,182

## Validation of PhEq_bootstrap

Hi again,

re. the index: A simple solution is not to sample 1000000 times but 999999. Another solution is to sample so many times that the result is the same regardless of whether you round up or down when finding the index of the 5% percentile etc. This, however, takes a bit of time to figure out.

I could be wrong, but...
Best regards,
ElMaestro
Shuanghe
★★

Spain,
2018-07-17 15:54

@ Helmut
Posting: # 19077
Views: 2,117

## Validation of PhEq_bootstrap

Hi Helmut!

» Now for the bad news. There is one thing stupid in the software when it comes to validation (line 535 of the source Conv_to_learn1.pas):

RandSeed:=random(random(1000000))+random(random(1000000));

That means that the seed of the (pseudo)random number generator is always a random number itself. Therefore, it is impossible to reproduce runs (try it: you will not get exactly my results).

What a coincidence! I was in a meeting recently with a regulatory agency and the issue of reproducibility came up since the agency couldn't reproduce the bootstrap f2 values we reported in the dossier. My first thought was the different seed number. I opened the source files one by one and searched words like random, seed, to get the line.

» IMHO, that’s bad coding practice.

Indeed. End user should be able to specify the seed number to be able to reproduce the analysis. I always leave a variable for seed number in my SAS code for bootstrap of BE data.

All the best,
Shuanghe
d_labes
★★★

Berlin, Germany,
2018-08-15 09:45

@ Helmut
Posting: # 19168
Views: 1,809

## PhEq_bootstrap in R

Dear Helmut, dear All,

just found:
bootf2BCA, a R equivalent of PhEq_bootstrap.
Contained are scripts for the R console and a GUI based on the shiny package.

May be better suited for validation since the source code may be modified to fit your needs, without access to the rarely used Object Pascal. E.g. introduce a setting of the seed to assure the same results between different runs.

Be warned: The source code is hard to read, nearly no comments at all.

Be further warned: If you rise the number of bootstrap samples (default is 1000, too low IMHO) the result file will contain a huge number of entries, each bootstrap sample is recorded! The same mess as in the 'Object Pascal' implementation. But could of course also changed with a little knowledge of R.

Regards,

Detlew
Bioequivalence and Bioavailability Forum |  Admin contact
19,688 posts in 4,178 threads, 1,353 registered users;
online 7 (0 registered, 7 guests [including 6 identified bots]).
Forum time (Europe/Vienna): 10:08 CEST

Power. That which statisticians are always calculating
but never have.    Stephen Senn

The BIOEQUIVALENCE / BIOAVAILABILITY FORUM is hosted by
Ing. Helmut Schütz