Validation of PhEq_bootstrap [Software]

posted by Helmut Homepage – Vienna, Austria, 2018-07-13 14:54 (1964 d 20:51 ago) – Posting: # 19045
Views: 13,240

Hi ElMaestro,

❝ ❝ Dokumentation, darling! […]

❝ Yes, that is a lot. Perhaps a lot of time is spent writing the datasets to file streams and flushing file buffers.

This was my first guess as well. I assumed that the author used a HD and the 25,000 on my machine were so fast because I have SSDs. Then I inspected the source and observed what happened with the 1 mio bootstraps. The file is allocated with the computed size, then bootrapping is performed, the array sorted, and at the end written to the file. No disk IO, everything in RAM.

❝ […] ~144 Mb, plus the array of f2*. Allocating that sort of memory should not be an issue ever on newer hardware.


❝ What is the use of having all that info cached??

Cause at the end it has to be written to the file. Not my code but I would include options to specify what – if anything at all – has to be documented.

❝ Under bootstrapping the i'th dataset is as valid as the j'th dataset where (i,j) is in ([0;B], [0;B]). When I write resampling code I usually just extract a few dataset and print e.g. the 111'th sample or whatever and check that it has the right set of elements (i.e. that the 111'th dataset (or whatver) for Test only has Test values at the 'right' time points and so forth).


❝ What I find little more tricky is to convince myself that the derivation of CI's from the vector of f2* values is valid. A key element here is the sorting.

Not only that. Getting the CI from the array index needs interpolation. More further down.

❝ What I did once was to output the sorted first 65536 rows of the sorted array and import them in a spreadsheet (if I recall correctly 65536 is the maximum number of rows in my spreadsheet). Then copy those values to another column, sort this column, and check if row difference exist. A row difference would indicate a sorting mishap in one of the applications.

Checking sorting? Kudos!
Tricky is the CI. Try it with a very small data set and use the spreadsheet’s PERCENTILE() function. I was suspicious when I looked at the source of PhEq_bootstrap whether it will be the same.
Example in [image]:

x      <- c(51.6730571183459, 52.3852052174518, 51.4902456409739,
            52.1787527612649, 50.9739796238878, 52.1273407615101,
            52.6282854486968, 51.5029428204135, 51.5056701956140,
            52.5994758210502, 52.3035556756045, 51.6613007946535)
n      <- length(x)
mean.x <- mean(x)
var.x  <- var(x)
sd.x   <- sqrt(var.x)
alpha  <- 0.05 # for two-sided 90% CI
CI.col <- c(paste0(100*alpha, "%"), paste0(100*(1-alpha), "%"))
err.z  <- qnorm(1-alpha)*sd.x/sqrt(n)      # based on normal distribution
err.t  <- qt(1-alpha, df=n-1)*sd.x/sqrt(n) # based on t-distribution
CI.z   <- mean.x + c(-1, +1)*err.z
CI.t   <- mean.x + c(-1, +1)*err.t
names(CI.t) <- names(CI.z) <- CI.col
CI.rep <- c(lower=50.97, upper=52.60) # reported 90% CI
result <- matrix(c(CI.z, CI.t, rep(FALSE, 2), rep(FALSE, 2)), nrow=2, ncol=4,
                 dimnames=list(c("CI.z", "CI.t"), c(CI.col, "lo", "hi")))
for (type in 1:9) {
  CI.boot.type. <- c(quantile(x, probs=c(alpha, 1-alpha), type=type),
                     rep(FALSE, 2))
  result <- rbind(result, CI.boot.type.)
row.names(result)[3:11] <- paste0(row.names(result)[3:11], 1:9)
for (j in 1:nrow(result)) {
  if (round(as.numeric(result[j,  "5%"]), 2) == CI.rep[1])
    result[j, "lo"] <- TRUE
  if (round(as.numeric(result[j, "95%"]), 2) == CI.rep[2])
    result[j, "hi"] <- TRUE
print( # 1=match, 0=no match

                     5%      95% lo hi
CI.z           51.67159 51.64886  0  0
CI.t           52.16671 52.18944  0  0
CI.boot.type.1 50.97398 52.62829  1  0
CI.boot.type.2 50.97398 52.62829  1  0

CI.boot.type.3 50.97398 52.59948  1  1
CI.boot.type.4 50.97398 52.61100  1  0
CI.boot.type.5 51.02561 52.62540  0  0
CI.boot.type.6 50.97398 52.62829  1  0

CI.boot.type.7 51.25793 52.61244  0  0
CI.boot.type.8 50.97398 52.62829  1  0
CI.boot.type.9 50.97398 52.62829  1  0

R and SAS provide various types of interpolations. PhEq_bootstrap uses “Type 3” (in [image] lingo) which, AFAIK, is also the default in SAS (PCTLDEF=5). All spreadsheets I have tested (Excel, OO Calc, Gnumeric) use “Type 7”. Of course, with already moderate samples differences will practically disappear. But checking sumfink like this is the fun of software validation, right?

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

Complete thread:

UA Flag
 Admin contact
22,811 posts in 4,783 threads, 1,642 registered users;
11 visitors (0 registered, 11 guests [including 6 identified bots]).
Forum time: 10:45 CET (Europe/Vienna)

Every man gets a narrower and narrower field of knowledge
in which he must be an expert in order to compete with other people.
The specialist knows more and more about less and less
and finally knows everything about nothing.    Konrad Lorenz

The Bioequivalence and Bioavailability Forum is hosted by
BEBAC Ing. Helmut Schütz