Geometric mean? [Regulatives / Guidelines]

posted by Helmut Homepage – Vienna, Austria, 2024-11-11 15:53 (20 d 03:23 ago) – Posting: # 24272
Views: 1,545

Hi Mittyri & BEQool,

❝ While intuitively, you might think that the geometric mean of a set containing 0 should be 0 (since any product involving 0 is 0), the formal definition and intended purpose of the geometric mean don't align well with this interpretation.


Right. The complete1–3 definition is $$\overline{x}_\text{geom}=\sqrt[n\phantom{i}]{\prod_{i=1}^{n}x_i}=\sqrt[n]{x_1\cdot x_2\cdots x_n}\phantom{m}\color{Red}{\forall x_i\in}\color{Red}{\mathbb{R}^{+}}\tag{1}$$ In other words, the geometric mean must only be calculated for positive real numbers (\(\small{x_i>0}\)). A correct implementation in software should throw an error if the the data contain zero(s).

❝ Imagine 2 datasets:

c(1,1,1,1,0)

c(0,0,0,0,1)

❝ Then both will have the same geometric mean value. It loses information about the other values and doesn't accurately reflect the "typical value" or "average rate" that the geometric mean aims to represent.

I agree (giving 0 for both sets) but only with the naïve calculation allowing zeros. However, acc. to \(\small{(1)}\) it must not be calculated for both sets. Even GEOMEAN() in Excel ‘knows’ that.

[image]

There is no function to calculate the geometric mean in base [image] but in two packages: psych is happy with zeros (WTF?), whereas EnvStats is not. Try my homebrew:

library(psych)    # contains geometric.mean()
library(EnvStats) # contains geoMean()
gm <- function(x, print = TRUE) {
  x   <- x[!is.na(x)]
  pos <- sign(x) == 1
  msg <- NULL
  if (sum(TRUE - pos) >= 1) {
    if (sum(TRUE - pos) == 1) {
      msg <- paste0("[1 non-positive value]")
    } else {
      msg <- paste0("[", sum(TRUE - pos), " non-positive values]")
    }
    res <- NA
  } else {
    res <- prod(x)^(1 / length(x))
  }
  if (print) {
    cat("gm()            :", res, msg, "\n")
  } else {
    return(res)
  }
}
gm1 <- function(x, print = TRUE) {
  x   <- x[!is.na(x)]
  neg <- sign(x) == -1
  msg <- NULL
  if (sum(neg) >=1) {
    msg <- paste0("[", sum(neg), " negative(s) excluded]")
    x   <- x[!neg]
  }
  res <- prod(x)^(1 / length(x))
  if (print) {
    cat("gm1()           :", res, msg, "\n")
  } else {
    return(res)
  }
}
gm2 <- function(x, print = TRUE) {
  x   <- x[!is.na(x)]
  pos <- sign(x) == 1
  msg <- NULL
  if (sum(TRUE - pos) >= 1) {
    if (sum(TRUE - pos) == 1) {
      msg <- paste0("[1 non-positive excluded]")
    } else {
      msg <- paste0("[", sum(TRUE - pos), " non-positives excluded]")
    }
    x <- x[pos]
  }
  res <- prod(x)^(1 / length(x))
  if (print) {
    cat("gm2()           :", res, msg, "\n")
  } else {
    return(res)
  }
}
gm3 <- function(x, print = TRUE) {
  res <- geometric.mean(x)
  if (print) {
    cat("geometric.mean():", res, "\n")
  } else {
    return(res)
  }
}
gm4 <- function(x, print = TRUE) {
  res <- geoMean(x)
  if (print) {
    cat("geoMean()       :", res, "\n")
  } else {
    return(res)
  }
}
am <- function(x, print = TRUE) {
  res <- mean(x, rm = TRUE)
  if (print) {
    cat("mean()          :", res, "\n")
  } else {
    return(res)
  }
}

set1 <- c(1, 1, 1, 1, 0)
set2 <- c(0, 0, 0, 0, 1)

gm(set1); gm1(set1); gm3(set1); gm2(set1); gm4(set1); am(set1)

gm()            : NA [1 non-positive value]
gm1()           : 0
geometric.mean(): 0
gm2()           : 1 [1 non-positive excluded]
geoMean()       : NA
Warning message:
In geoMean(x) : Non-positive values in 'x'
mean()          : 0.8


gm(set2); gm1(set2); gm3(set2); gm2(set2); gm4(set2); am(set2)

gm()            : NA [4 non-positive values]
gm1()           : 0
geometric.mean(): 0
gm2()           : 1 [4 non-positives excluded]
geoMean()       : NA
Warning message:
In geoMean(x) : Non-positive values in 'x'
mean()          : 0.2

Only function gm() and the function geoMean() of EnvStats get it right.

❝ Regarding geometric mean: kind of tricky taking into account the information above. The median is better, but why don't you want to follow the method mentioned in BE GL at the first place (arithmetic mean)? Yes, it is bad, but at least not worse than others :-D

Well, the geometric mean is the maximum likelihood estimator of the median. I don’t care what’s stated in the GL. I prefer the median over the arithmetic mean given the expected distribution of concentrations. Not normal… :-D
I agree also with what you wrote about the loss of information and why I prefer the median to keep it: \(\widetilde{x}\small{\left\{1,1,1,1,0\right\}=1}\) and \(\widetilde{x}\small{\left\{0,0,0,0,1\right\}=0}\).

Note that the ICH M13A states:

Multiple baseline endogenous concentrations should be measured from each subject in the time period before administration of the study drug. The time-averaged baseline [… is] subtracted from post-dose concentrations for those subjects in an appropriate manner consistent with the PK properties of the drug. For the time-averaged method, either the mean or median value may be used.

(my emphasis)

PS: I always used the median in my studies. Was never – ever – questioned by any agency.


  1. Netz H. Formeln der Mathematik. München: Hanser; 6th ed. 1986. p. 18.
  2. Nolan D, Speed T. Stat Labs. Mathematical Statistics Through Applications. New York: Springer; 2001. p. 68.
  3. Sachs S, Hedderich J. Angewandte Statistik. Methodensammlung mit R. Berlin: Springer; 12th ed. 2006. p. 76.

Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

Complete thread:

UA Flag
Activity
 Admin contact
23,328 posts in 4,898 threads, 1,662 registered users;
91 visitors (0 registered, 91 guests [including 12 identified bots]).
Forum time: 19:16 CET (Europe/Vienna)

Satisfaction of one’s curiosity is one of the greatest sources
of happiness in life.    Linus Pauling

The Bioequivalence and Bioavailability Forum is hosted by
BEBAC Ing. Helmut Schütz
HTML5