ElMaestro
★★★

Belgium?,
2019-07-18 13:43

Posting: # 20388
Views: 2,188
 

 Spreadsheet failures, any recent examples? [Software]

Hi all,

does anyone have recent examples of spreadsheet functions truly misbehaving with real-life realistic data?

I am not talking about for example the cases of standard deviations or variances that went funky when the application was fed extreme data ranges, or similar waterboarding of software, which was debated 10 years ago and which led regulators to conclude they were the only ones who could be trusted to make proper use of such pernicious creations.

Realistic.
Data.
Only.
Please.

I will leave it to the reader to quantitatively define what "realistic" and "recent" actually mean and I reserve the right to challenge the definitions, obviously :-D

Of course I ask because I never thought spreadsheets were or are as bad as their reputation. The users, on the other hand, are every bit as bad as their reputation. This certainly includes myself. Most likely I am the worst. But that will be the topic of another thread :-)

I could be wrong, but...
Best regards,
ElMaestro
Ohlbe
★★★

France,
2019-07-18 14:32

@ ElMaestro
Posting: # 20389
Views: 2,071
 

 Spreadsheet failures, any recent examples?

Dear ElMaestro,

» does anyone have recent examples of spreadsheet functions truly misbehaving with real-life realistic data?

Nope. I have examples of users using the wrong functions, or unlocked spreadsheets getting modified at some point, but that's about all. The problem is most often somewhere between the chair and the keyboard: wetware rather than software.

Well, there are some "features" of Excel you need to be aware of when using it for stats, of course. Already mentioned on the forum.

» [...] which led regulators to conclude they were the only ones who could be trusted to make proper use of such pernicious creations.

Maybe their coding skills are not any better than mine*, and they consider that they will always trust whatever they do with a spreadsheet more than anything they might obtain with R ? :-D

* sharing with the forum some experience already discussed by email with ElMaestro a couple of years ago: I installed R on my home computer. With the help of some "R for R-seholes" tutorials, it took me something like 4 hours to manage to open a simple set of data and make a simple plot out of it :PCchaos:. I spent a few more hours trying to have the colour of the plot change based on simple conditions (red if under a threshold value, green above it). I was even waking up at night with ideas on new ways to test. I never succeeded. Never used R again. I would never trust any of my results anyway.:lamer:

Regards
Ohlbe
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-18 16:12

@ Ohlbe
Posting: # 20390
Views: 2,072
 

 Spreadsheet addiction

Dear Ohlbe

» » [...] which led regulators to conclude they were the only ones who could be trusted to make proper use of such pernicious creations.
»
» Maybe their coding skills are not any better than mine*, and they consider that they will always trust whatever they do with a spreadsheet more than anything they might obtain with R ? :-D

There is a series of papers about indirect adjusted comparisons in BE published by regulators. All done in Excel.

» * […] I spent a few more hours trying to have the colour of the plot change based on simple conditions (red if under a threshold value, green above it). I was even waking up at night with ideas on new ways to test. I never succeeded.

I believe it. R is a nasty beast like SAS (© Detlew). Waking up in the middle of the night or – worse – not being able to fall asleep at all is a common side-effect. Since you will never use R again, skip this code:

op  <- par(no.readonly = TRUE)    # Safe original graphics parameters
par(pty = "s")                    # I want a square plotting region
x   <- runif(n=50, min=1, max=20) # Sample from the uniform distribution
a   <- 0.5                        # Intercept
b   <- 2                          # Slope
y   <- a + b * x + rnorm(n=length(x), mean=0, sd=2) # Response + random error
th  <- 10                         # Threshold
plot(x, y, type = "n", las = 1)   # Important: type="n", will add points later
grid()                            # Nice to have one
abline(h = th, col = "red")       # Line for threshold
abline(lm(y ~ x), lwd = 2, col = "blue")    # Linear regression
points(x[y < th], y[y < th], pch = 21, cex = 1.5,
       col = "red", bg = "#FFD700AA")       # Below threshold
points(x[y >= th]y[y >= th], pch = 21, cex = 1.5,
       col = "darkgreen", bg = "#32CD32AA") # At least threshold
par(op)                           # Restore original graphics parameters


Which gives:

[image]

Semitransparent colors in Excel? I don’t think that’s possible. ;-)

Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
Ohlbe
★★★

France,
2019-07-18 17:30

@ Helmut
Posting: # 20391
Views: 2,060
 

 Nasty beast

Dear Helmut,

» Since you will never use R again, skip this code

Thx... I made multiple attempts with "if" and conditions, and got multiple different error messages. ElMaestro gave me some hints, resulting each time in a "oh yes of course" reaction and more error messages... I probably misinterpreted the hints.

Regards
Ohlbe
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-18 20:26

@ Ohlbe
Posting: # 20396
Views: 2,030
 

 Nasty beast

Dear Ohlbe,

» I made multiple attempts with "if" and conditions, and got multiple different error messages. ElMaestro gave me some hints, resulting each time in a "oh yes of course" reaction and more error messages... I probably misinterpreted the hints.

Using if(), its relatives, and loop-constructs (for(), while(), repeat()) in many cases are prone to errors – as you experienced – and can be slooow. If you have a vector of data (say y) there a various ways to access its values. Example with 25 random integers in the range 1…40:

n  <- 25
y  <- round(runif(n = n, min = 1, max = 40))
th <- 10             # threshold
n  <- length(y)      # we know but useful later
head(y, 10)          # the first 10
y[1:10]              # same
tail(y, 10)          # the last 10
y[(n - 9):n]         # same but tricky
blq <- which(y < th) # the ones are below the threshold
length(blq)          # how many?
y[blq]               # show them
y[y < th]            # same

In such a simple case which() is an overkill, though one can include many conditions which [sic] makes the code more easy to understand. If you have two vectors (say x, y) like in my last post, you can select values of x depending on which(y = condition).
Let’s check:

n    <- 50
x    <- runif(n = n, min = 1, max = 20)
a    <- 0.5
b    <- 2
y    <- a + b * x + rnorm(n = length(x),  mean = 0, sd = 2)
th   <- 10

fun1 <- function(x, y, th) { # clumsy
  x.th <- numeric()
  y.th <- numeric()
  bql  <- 0L
  for (k in seq_along(x)) {
    if (y[k] < th) {
      bql       <- bql + 1
      x.th[bql] <- x[k]
      y.th[bql] <- y[k]
    }
  }
  z <- data.frame(x.th, y.th)
  return(invisible(z))
}

fun2 <- function(x, y, th) { # better
  blq  <- which(y < th)
  x.th <- x[blq]
  y.th <- y[blq]
  z    <- data.frame(x.th, y.th)
  return(invisible(z))
}

fun3 <- function(x, y, th) { # a little bit confusing for beginners
  x.th <- x[y < th]
  y.th <- y[y < th]
  z    <- data.frame(x.th, y.th)
  return(invisible(z))
}

res1 <- fun1(x, y, th)
res2 <- fun2(x, y, th)
res3 <- fun3(x, y, th)
identical(res1, res2); identical(res1, res3) # same?
[1] TRUE
[1] TRUE

Bingo! Which one is easier to read?
What about speed?

library(microbenchmark)
res <- microbenchmark(fun1(x, y, th),
                      fun2(x, y, th),
                      fun3(x, y, th), times = 1000L)
print(res, signif = 4)
Unit: microseconds
           expr   min    lq  mean median    uq  max neval cld
 fun1(x, y, th) 173.9 181.1 196.1  186.3 189.9 1530  1000   b
 fun2(x, y, th) 163.0 170.3 183.6  175.4 179.0 3311  1000  a
 fun3(x, y, th) 163.0 169.0 185.5  174.5 178.4 3310  1000  ab

Practically the same, since vectors are short. If we play this game with longer vectors fun2() shows its strengths.

[image]


If one has more conditions any external construct will suck. We have 100 random integers (1…50) and want to get the even ones between 20 and 30 in increasing order.

x <- round(runif(n = 100, min = 1, max = 50))
x
  [1] 37 29  8 34  6 13 47 22 30 44  6 14 33 18 12 37 32
 [18] 10  6 37 27  2 43 40  7  5 47  4 32 17  7 50 39 36
 [35] 38 48 34  5 21 43 34 50 29 20 33  6 45 32 28  8  1
 [52] 26 29 19 42  9 38 31 25  4  1 23 37 31  2 26 29 24
 [69] 40 43 17 16 41 17  5 17 36 16  7  5 36 30  5  8 19
 [86] 40 42 30 33 21 13 25 21 33 16  7 33 36 19 37

One-liner:

sort(x[which(x >= 20 & x <= 30 & x %%2 == 0)])
[1] 20 22 24 26 26 28 30 30 30

Or even shorter:

sort(x[x >= 20 & x <= 30 & x %%2 == 0])
[1] 20 22 24 26 26 28 30 30 30

Good luck coding that with a loop and a nested if() { do this } else { do that } elseif { oops }.

Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
Ohlbe
★★★

France,
2019-07-19 11:32

@ Helmut
Posting: # 20397
Views: 1,954
 

 Nasty beast

Dear Helmut,

» Using if(), its relatives, and loop-constructs (for(), while(), repeat()) in many cases are prone to errors

» Good luck coding that with a loop and a nested if() { do this } else { do that } elseif { oops }

I think that's really what pissed me off. I thought something like if... else... would be very logical and easy. Maybe because these are real words from a language I can understand. But it looks like R and I are following different kinds of logics.

When you learn a new language and start speaking it with somebody, that person will usually show some goodwill, ignore grammatical mistakes and try and understand what you mean. R made no efforts of any kind, even though what I was trying to achieve was pretty obvious :-D

» If you have a vector of data (say y)

Yeah, ElMaestro started using this kind of vocabulary too when we were exchanging email. That got me even more confused. All I learnt at school about vectors is the geometric side of the concept - what Wikipedia apparently calls Euclidian vector. I had some problems understanding what an arrow between two points had to do with what I was trying to achieve.
Sorry ElMaestro, we're going quite far from your original question... :offtopic!:

Regards
Ohlbe
ElMaestro
★★★

Belgium?,
2019-07-19 12:32

@ Ohlbe
Posting: # 20399
Views: 1,942
 

 Nasty beast

Hi Ohlbe,
» Sorry ElMaestro, we're going quite far from your original question... :offtopic!:

That's OK.
The occasional drift on this forum always pleases me.


As for the logic, it is absolutely !7==!3 what you are saying, programming involves that kind of logic which is entirely bound to ones and zeros and that can be hard to grasp when the reference is our everyday language.

If it is any comfort, I can confess I usually can't read Hötzi's code right away. I usually use a lot of time deciphering it, or giving up. This is because syntactically it is condensed. I tend to use much longer lines and bigger functions to achieve the same goals. I can't maintain code otherwise, but this is just me.

Vector: Something that carries something, according to Birdsong's Internally Hosted Vocabulary (the platform is walnut-sized).
The object of this carriage may be information, like "two steps in this direction and then five steps in the perpendicular direction". Or "Apple and thereafter Syphilis".
The origin of the word is from the Illiad by Homer, song 13, in which Thetis says: "Vector bears it in triumph upon his own shoulders.", possibly expressed as such because he was in love?! Or at least this is my belief since it was said in relation to some amour.

I could be wrong, but...
Best regards,
ElMaestro
Ohlbe
★★★

France,
2019-07-19 13:38

@ ElMaestro
Posting: # 20402
Views: 1,909
 

 Decidedly off topic

Hi ElMaestro,

» The origin of the word is from the Illiad by Homer, song 13, in which Thetis says: "Vector bears it in triumph upon his own shoulders.", possibly expressed as such because he was in love?! Or at least this is my belief since it was said in relation to some amour.

:lol2:

But if I understand the text well, this is only applicable to Trojan virus coding...

Regards
Ohlbe
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-19 13:00

@ Ohlbe
Posting: # 20401
Views: 1,933
 

 OT: R limbo 101

[image]Dear Ohlbe,

» I thought something like if... else... would be very logical and easy. Maybe because these are real words from a language I can understand.

Perfectly understandable. Of course, these constructs exist in R and we use them all the time. The point is only that direct access of values (with a condition) is simply more efficient.
If you don’t have to deal with large data sets, both are fine. Since I deal a lot with simulations (1 mio BE studies with a sometimes high number of subjects / periods / sequences, …) I learned the hard way to avoid the ‘simple’ constructs if ever possible (given, sometimes I know that it could be done but was too stupid).
Think about my previous example. For 1 mio calls the direct access is 15times faster than the loop and if() within. It’s not unusual to have nested calls. For two it would by ~230times slower and for three already ~3,500times. Would test your patience.

» But it looks like R and I are following different kinds of logics.

Not necessarily so. See above.

» When you learn a new language and start speaking it with somebody, that person will usually show some goodwill, ignore grammatical mistakes and try and understand what you mean. R made no efforts of any kind, even though what I was trying to achieve was pretty obvious :-D

I agree. See the subject line. The man-pages of R’s functions were written by statisticians, regularly forgetting non-expert users. Sometimes the response you get is bizarre. Want to know how for() works?

?for
+

+
Fuck, get me of here! esc
help(for)
Error: unexpected ')' in "help(for)"

What the heck? Shall I really try it without the closing bracket? Strange. OK, OK.
help(for
+

Aha, the closing bracket is missing!
)
Error: unexpected ')' in:
"
)"

You can’t be serious! Google, google, reading the R-Inferno Section 8.2.30. Oh my!
?`for`
help(`for`)

Both work as expected. I don’t know what’s the logic behind. Beyond me.


» » If you have a vector of data (say y)
»
» Yeah, ElMaestro started using this kind of vocabulary too when we were exchanging email. […] All I learnt at school about vectors is the geometric side of the concept - what Wikipedia apparently calls Euclidian vector.

Correct.

» I had some problems understanding what an arrow between two points had to do with what I was trying to achieve.

Absolutely nothing, of course. In R a [image] one-dimensional array is meant. The indexing (i.e., where it starts) differs between languages.
  • 0
    C, C++, BASIC-dialects, Pascal (default; can be declared as 1), PHP, Perl, Java, JavaScript, Lisp, Python, Ruby, Scheme, …
  • 1
    S, R, Julia, FORTRAN, ALGOL, AWK, COBOL, Matlab, Mathematica, Wolfram Language, XQuery, …
In many languages there are three types of data structures: scalar (a single number, character, string, boolean), one- (R: vector) and multidimensional array (R: data.frame and matrix). In R there are no scalars, only vectors containing a single element instead.

x1   <- 1
x2   <- 1L
x3   <- TRUE
x4   <- NA
x5   <- "a"
comp <- matrix(data = c(c(x1, x2, x3, x4, x5),
                        c(is.vector(x1), is.vector(x2), is.vector(x3),
                          is.vector(x4), is.vector(x5)),
                        c(length(x1), length(x2), length(x3),
                          length(x4), length(x5)),
                        c(typeof(x1), typeof(x2), typeof(x3),
                          typeof(x4),typeof(x5))),
               nrow = 5, ncol = 4,
               dimnames = list(paste0("x", 1:5),
                               c("value", "vector?", "length", "type")))
print(as.data.frame(comp))

   value vector? length      type
x1     1    TRUE      1    double
x2     1    TRUE      1   integer
x3  TRUE    TRUE      1   logical
x4  <NA>    TRUE      1   logical
x5     a    TRUE      1 character


What makes R so powerful is yet another data type, namely the list. A list can contain any of the other types and a mixture of them (even other lists…).

char.vect <- letters[1:2]
int.vect  <- as.integer(1:3)
mydf      <- data.frame(x = 1:4, y = c(4:6, NA))
mymat     <- matrix(data = c(100:102, 2*100:102), nrow = 3, ncol = 2,
                    dimnames = list(paste0("row", 1:3), c("x", "y")))
mylist    <- list(char.vect = char.vect, int.vect = int.vect,
                  mydf = mydf, mymat = mymat)
print(mylist)

$char.vect
[1] "a" "b"

$int.vect
[1] 1 2 3

$mydf
  x  y
1 1  4
2 2  5
3 3  6
4 4 NA

$mymat
       x   y
row1 100 200
row2 101 202
row3 102 204


Elements can be accessed by their name or index.

mylist$mymat["row2", "y"]
[1] 202
mylist$mymat[2, 2]
[1] 202
mylist[[4]][2, 2]
[1] 202


The double square brackets are mandatory to access the root element of lists. Hence, mylist[4][2, 2] does not work:
Error in mylist[4][2, 2] : incorrect number of dimensions

Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
Shuanghe
★★  

Spain,
2019-07-18 18:59

@ Helmut
Posting: # 20392
Views: 2,060
 

 Spreadsheet addiction

Dear all

» There is a series of papers about indirect adjusted comparisons in BE published by regulators. All done in Excel.

As far as I can tell, Excel and R give the same result. I was interested in the method so several weeks back I played with R using data from EPAR (none of the articles gives sample size of the BE study so without going to the EPAR for this additional information you can not reproduce/validate their results at all) and the difference in some comparison is 0.01% due to different rounding method used in Excel and R.

However, as Ohlbe said, there are some errors in the articles such as in the latest article (link), table 2 gives wrong data (I compared the original data from EPAR and it's clearly a mistake in article) but the result of the comparison using that data is correct, so this might just a typo. However, some comparisons were reversed, e.g., Table 3 describs generic 2 vs generic 4, and generic 3 vs generic 4, but in fact it were generic 4 vs 3 and 2, ... Apparently, someone typed the data in the wrong cell in excel so confidence interval were all reversed. :-D

» Semitransparent colors in Excel? I don’t think that’s possible. ;-)

Haha, this is where you err... :-D Actually you can do that in Excel (not that you should). At least I can in my 2013. It's ugly, but semi-transparent it is. :cool:

All the best,
Shuanghe
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-18 20:04

@ Shuanghe
Posting: # 20395
Views: 2,056
 

 Spreadsheet addiction

Hi Shuanghe,

» I was interested in the method so several weeks back I played with R using data from EPAR (none of the articles gives sample size of the BE study so without going to the EPAR for this additional information you can not reproduce/validate their results at all) …

Yep. I wrote a package for this stuff. I think they were using Excel cause (a) everybody has it (b) of its flat learning curve and (c) – most important – Luther Gwaza* gave his “template” to the others.

» … and the difference in some comparison is 0.01% due to different rounding method used in Excel and R.

If you want to reproduce Excel’s wacky “commercial rounding”:

cround <- function(x, d) ((trunc(abs(x) * 10^d + 0.5)) / 10^d) * sign(x)
x      <- c(79.994, 79.995, 125.004, 125.005)
comp   <- data.frame(x = x, R = round(x, 2), Excel = cround(x, 2))
print(comp, row.names = FALSE)
       x      R  Excel
  79.994  79.99  79.99
  79.995  80.00  80.00
 125.004 125.00 125.00
 125.005 125.00
125.01


» » Semitransparent colors in Excel? I don’t think that’s possible. ;-)
»
» Haha, this is where you err... :-D Actually you can do that in Excel (not that you should). At least I can in my 2013. It's ugly, but semi-transparent it is. :cool:

Well, my version is Excel 2000. BTW, navigate up to my OP and hit ctrlF5 to clear the browser’s cache / reload. Is it possible in Excel to have the color dependent on the value (the lower, the darker)? :smoke:


  • Gwaza L. Adjusted Indirect Treatment Comparisons of Bioequivalence Studies. PhD Thesis: Utrecht. 2016. [image] free resource.

Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
Shuanghe
★★  

Spain,
2019-07-19 12:41

@ Helmut
Posting: # 20400
Views: 1,932
 

 OT: Spreadsheet addiction

Hi Helmut,

» Yep. I wrote a package for this stuff.

Well, you're always several steps ahead of me. I just wrote some lousy function that serves the educational purpose (for myself). My output is for 4 methods only (Z-distribution, homo- and hetero-scedastic and pragmatic) since I'm still struggling with Chow and Shao/Liu's methods. By the way, I didn't see the package in your Github page. Any plan to put it there so we can have a peek? :-D

» If you want to reproduce Excel’s wacky “commercial rounding”:
» ...

I doubt I'll very use it but nice to know that it can be done. ;-)

» Well, my version is Excel 2000. BTW, navigate up to my OP and hit ctrlF5 to clear the browser’s cache / reload. Is it possible in Excel to have the color dependent on the value (the lower, the darker)? :smoke:

No way. Now you just want to piss off those Excel lovers :lol3:

»
  • Gwaza L. Adjusted Indirect Treatment Comparisons of Bioequivalence Studies. PhD Thesis: Utrecht. 2016. [image] free resource.

Thanks. I have many articles but I didn't know about the thesis. I'll take a look at it later.

ElMaestro, apology for the off-topic posting.

All the best,
Shuanghe
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-19 14:29

@ Shuanghe
Posting: # 20403
Views: 1,888
 

 OT: Spreadsheet addiction

Hi Shuanghe,

» » Yep. I wrote a package for this stuff.
» Well, you're always several steps ahead of me.

Nope. I started on July 11th.

» My output is for 4 methods only (Z-distribution, homo- and hetero-scedastic and pragmatic) …

Mine only for homo- and heteroscedastic variances (the latter is the package’s default). I think that relying on large sample approximations doesn’t make sense at all. The CI will always be narrower than with the other approaches but why would one go this way? Approximations of degrees of freedom for unequal variances were published in 1946/47 (Satterthwaite, Welch), the t-distribution in 1908 (Gosset a.k.a. Student). Going back more than two than two centuries (Laplace, Poisson, Gauß)? I don’t get it. BTW, I don’t like the pragmatic method.

» … since I'm still struggling with Chow and Shao/Liu's methods.

You need the raw data, right?

» […] I didn't see the package in your Github page. Any plan to put it there so we can have a peek? :-D

Paid job – chained by a CDA. If we will publish sumfink, sure. Not my decision.

» » If you want to reproduce Excel’s wacky “commercial rounding”:
» » ...
» I doubt I'll very use it but nice to know that it can be done. ;-)

I would never ever use it. However, if you want to reproduce sumfink which was rounded in Excel you can only experiment till you get the same in R.

» » […] navigate up to my OP and hit ctrlF5 to clear the browser’s cache / reload. Is it possible in Excel?
»
» No way. Now you just want to piss off those Excel lovers :lol3:

Yessir! R is a language and environment for statistical computing and graphics.

Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
nobody
nothing

2019-07-19 15:53

@ Helmut
Posting: # 20404
Views: 1,855
 

 OT: Spreadsheet addiction

....reaaaaalllly boring in Vienna these days, huh? :-D

Kindest regards, nobody
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-19 19:32

@ nobody
Posting: # 20406
Views: 1,823
 

 OT: Spreadsheet addiction

Hi nobody,

» ....reaaaaalllly boring in Vienna these days, huh? :-D

OT about R is a welcomed excuse.

   All paid jobs
absorb and degrade the mind.
     Aristotle

Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
zizou
★    

Plzeň, Czech Republic,
2019-07-20 00:18

@ Ohlbe
Posting: # 20408
Views: 1,783
 

 Spreadsheet etc. failures

Hello everybody and nobody.

» Nope. I have examples of users using the wrong functions, or unlocked spreadsheets getting modified at some point, but that's about all. The problem is most often somewhere between the chair and the keyboard: wetware rather than software.
»
» Well, there are some "features" of Excel you need to be aware of when using it for stats, of course. Already mentioned on the forum.
Right.

Recently, because of "floating point error" I failed to use spreadsheet for comparisons of differences of two similar values against certain value (my source data were almost all with two decimal places, but some of them were calculated with "full precision" and I wanted to easily check how many of differences are lower than 0.05, unfortunaly there was a case with the error).

See following examples:

      |          |   Default View    | Number Format with
 Cell | Formula  | in General Format | Additional Decimal Places
──────┼──────────┼───────────────────┼───────────────────────────
  A1  | 25.05    |                   |
  A2  | 25       |                   |
  A3  | =A1-A2   |       0.05        | 0.0500000000000007
  A4  | =A3=0.05 |       FALSE       |
  A5  | =A3>0.05 |       TRUE        |
Difference is higher than expected.

      |          |   Default View    | Number Format with
 Cell | Formula  | in General Format | Additional Decimal Places
──────┼──────────┼───────────────────┼───────────────────────────
  A1  | 5.05     |                   |
  A2  | 5        |                   |
  A3  | =A1-A2   |       0.05        | 0.0499999999999998
  A4  | =A3=0.05 |       FALSE       |
  A5  | =A3<0.05 |       TRUE        |
Difference is lower than expected.



Nevertheless it is not problem of spreadsheet itself, but only precision issue. In R:

a=5.05
b=5
a-b
#[1] 0.05
a-b==0.05
#[1] FALSE
a-b<0.05
#[1] TRUE


Everyone knows, only smart machines don't.

Best regards,
zizou
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-20 02:12

@ zizou
Posting: # 20409
Views: 1,763
 

 As designed ☺

Hi zizou,

» Recently, because of "floating point error" I failed to use spreadsheet for comparisons of differences of two similar values against certain value (my source data were almost all with two decimal places, but some of them were calculated with "full precision" and I wanted to easily check how many of differences are lower than 0.05, unfortunaly there was a case with the error).

Haha, I know this game and therefore, always round before the comparison. ;-)

» Everyone knows, only smart machines don't.

Can’t be blamed. That’s a property of floating point arithmetic (see also IEEE 754). Unless you have powers of 2 (dec→bin: 1→1, 2→10, 4→100, 8→1000, …) you will always have precision issues in the last significant digits. Everything beyond the 15th (dec-) significant figure is just noise – even with 64bit software on a 64bit OS.
Try this in Excel:
Cells A1–A9: 0, 1, … , 8
Cells B1–B9: =2^A1–A9
Cells C2–B9: =B2-B1, B3-B1, …
Format to 16 decimal places. What do you get?

Little bit deeper into R:

a <- 5.05
b <- 5
c <- a - b
identical(c, 0.05)
[1] FALSE
sprintf("%.17f", c)
[1] "0.04999999999999982"
a <- 4
b <- 2
c <- a - b
identical(c, 2)
[1] TRUE
sprintf("%.17f", c)
"2.00000000000000000"
identical(c, 2L)
[1] FALSE
is.integer(c)
[1] FALSE
a <- 4L # or as.integer(4)
b <- 2L # or as.integer(2)
c <- a - b
identical(c, 2)
[1] FALSE
sprintf("%.17f", c)
"2.00000000000000000"
identical(c, 2L)
[1] TRUE


GNU Octave 5.1.1.0 (all calculations internally in 64bit double):

>> format long
>> a = 5.05
a = 5.050000000000000
>> b = 5
b = 5
>> b - a
ans = -4.999999999999982e-02
>> a = 4
a = 4
>> b = 2
b = 2
>> a - b
ans = 2


Not even software for symbolic mathematics does better. Maxima 5.35.12.1 / wxMaxima 14.12.1:

(%i1) [a:5.05, b:0.5, a-b];
(%o1) [5.05,5,0.04999999999999982]
(%i2) [a:4, b:2, a-b];
(%o2) [4,2,2]


Only for R-nerds: Rmpfr. Doesn’t help. ;-)

library(Rmpfr)
a <- mpfr(5.05, 256)
b <- mpfr(5, 256)
diff(c(b, a))
1 'mpfr' number of precision  256   bits
[1] 0.04999999999999982236431605997495353221893310546875

mpfr(a-b, 32)
1 'mpfr' number of precision  32   bits
[1] 0.05

a <- mpfr(4, 256)
b <- mpfr(2, 256)
diff(c(b, a))
1 'mpfr' number of precision  256   bits
[1] 2


Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
ElMaestro
★★★

Belgium?,
2019-07-20 09:08

@ Helmut
Posting: # 20411
Views: 1,759
 

 As designed ☺

Hi Helmut,

» Haha, I know this game and therefore, always round before the comparison. ;-)

or never bring yourself into a situation where you need to compare floats.

I've taken the worst way out and written functions that check if x is simar to y plus/minus a wee faction like 1e-8 or something. It is ugly and clumsy, it works, and it feel everytime like I am suffering defeat.


"John Brown's got a little Endian" :-):-):-)

I could be wrong, but...
Best regards,
ElMaestro
Helmut
★★★
avatar
Homepage
Vienna, Austria,
2019-07-20 12:53

@ ElMaestro
Posting: # 20412
Views: 1,732
 

 To round or not to round…

Hi ElMaestro,

» » Haha, I know this game and therefore, always round before the comparison. ;-)
»
» or never bring yourself into a situation where you need to compare floats.

Did I tell to how many significant digits I round? :-D

» I've taken the worst way out and written functions that check if x is simar to y plus/minus a wee faction like 1e-8 or something.

I don’t know which kind of measurements you are comparing. Sumfink ultra-precise (according to the International Bureau of Weights and Measures: t ±5×10–16, l ±2.1×10–11)?

Zizou (I guess) and I were talking about concentrations. AP according to the GLs 20% at the LLOQ and 15% above. Hence, everything reported substantially beyond that is not relevant (shall I call it noise?). What substantially means, depends on the field. Some people go crazy with 6σ, physicists are fine with just 3σ.

x       <- pi
p       <- 15
s       <- 8
mult    <- c(3, 6)
sigma   <- x * p / 100
sigma.m <- sigma * mult
lo      <- x - sigma.m
hi      <- x + sigma.m
df      <- data.frame(x = rep(x, 2), prec = p,
                      sigma = sigma, mult = mult,
                      sigma.m = sigma.m,
                      lo = lo,  hi = hi)
df      <- signif(df, s)
names(df)[2] <- "prec. (%)"
print(df, row.names = FALSE)

       x prec. (%)     sigma mult  sigma.m        lo       hi
3.141593        15 0.4712389    3 1.413717 1.7278760 4.555309
3.141593        15 0.4712389    6 2.827433 0.3141593 5.969026

Do you get the idea? The true value might be somewhere between our self-imposed limits. Therefore, I don’t give a shit whether I get 3.141 or 3.141 592 653 589 793 in an electronic file. However, I insist in a CRC-checksum2 to verify the data-transfer. If I deal with bloody Excel, I round to one decimal beyond what is given in the analytical report3 being aware that it is far beyond the analytical AP. If the data-transfer of analytical results to stats was done electronically in “full numeric precision” (haha), I want to see validation for it.4

» It is ugly and clumsy, it works, and it feel everytime like I am suffering defeat.

Well, you are the C-man here. What about printf("%.yg\n", x); where y is the desired number of significant digits? With R’s signif():

options("digits" = 15)
x    <- 12345678.999999
y    <-        0.12345678999999
prec <- 8
fmt1 <- "%33.17f"
fmt2 <- paste0("%.", prec, "g")
cat(x, "\n",
    y, "\n",
    sprintf(fmt1, x), "\u2190 fake news\n",
    sprintf(fmt1, y), "\u2190 fake news\n",
    sprintf(fmt1, signif(x, prec)), "\u2190", prec, "significant digits\n",
    sprintf(fmt1, signif(y, prec)), "\u2190", prec, "significant digits\n",
    sprintf(fmt2, x), "\u2190", "directly with", paste0("'", fmt2, "'\n"),
    sprintf(fmt2, y), "\u2190", "directly with", paste0("'", fmt2, "'\n"))

12345678.999999
 0.12345678999999
        12345678.99999
899975955486 ← fake news
               0.12345678999999000 ← fake news
        12345679.00000000000000000 ← 8 significant digits
               0.12345679000000000 ← 8 significant digits
12345679 ← directly with '%.8g'
0.12345679 ← directly with '%.8g'


If you are interested whether rubbish in ≈ rubbish out, ask for a checksum and verify it. Probably better than diving into the murky waters of (likely irrelevant) rounding.


  1. I would be fine with just 3.1 for π. The RE is –1.32%, again much below what is achievable in bioanalytics.
  2. SHA-256 or higher preferred (collisions reported for SHA-1, i.e., different input give the same hash). MD5 is better than nothing.
  3. That’s the only GxP-compliant (dated/signed, released by QUA) document, right? Did you ever see result-tables with 15 significant digits?
  4. Here it gets tricky. These data are different to what is given in the analytical report. Now what? At least I expect a statement about this discrepancy in the protocols (analytical and/or statistical). Regularly I see something like “calculations were performed in full numeric precision”. How could one ever hope to verify that having only the analytical report with rounded results?

Cheers,
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. ☼
Science Quotes
ElMaestro
★★★

Belgium?,
2019-07-20 19:59

@ Helmut
Posting: # 20415
Views: 1,696
 

 To round or not to round…

Hi Hötzi,

» I don’t know which kind of measurements you are comparing.

It stinks but there really is no good solution when you have to compare.
(294.1/1000) == 0.2941
[1] FALSE
(2941/10000) == 0.2941
[1] TRUE


One can read ISO standards and manuals about internal float representation (I tried andfell asleep), but at the end of the day if you need to compare you need to compare and hope for the best. Some numbers just cannot be represented accurately in binary.

If you accept an xyz fraction of error on the comparison then the solution is straightforward, like

Match = function (x,y, relative.tol)
{
 if (y<((1+relative.tol)*x))
  if (y>((1-relative.tol)*x))
    return(T)
 return(F)
}

Match(6, 6.001, 0.01)


Ugly as hell :vomit:, and will not work well if you get into extreme binary representations.

I could be wrong, but...
Best regards,
ElMaestro
mittyri
★★  

Russia,
2019-07-20 22:43

@ ElMaestro
Posting: # 20416
Views: 1,677
 

 floating-point math is always more complex than you think it is

Hi ElMaestro,

» Match(6, 6.001, 0.01)
»
» Ugly as hell :vomit:, and will not work well if you get into extreme binary representations.

take a look at
if (Abs(x - y) <= absTol * Max(1.0f, Abs(x), Abs(y)))
the description is here

One may consider ULP, pros and cons described here

Kind regards,
Mittyri
Activity
 Thread view
Bioequivalence and Bioavailability Forum |  Admin contact
19,983 posts in 4,228 threads, 1,373 registered users;
online 9 (1 registered, 8 guests [including 6 identified bots]).
Forum time (Europe/Vienna): 15:56 CET

I believe that a scientist looking at nonscientific problems
is just as dumb as the next guy.    Richard Feynman

The BIOEQUIVALENCE / BIOAVAILABILITY FORUM is hosted by
BEBAC Ing. Helmut Schütz
HTML5