## Observations in a linear model [General Sta­tis­tics]

Dear all,

off-list I was asked whether it makes a difference if we have single observations or replicates. When you look at the standard equations, clearly the answer is no.
If you don’t believe that, an -script for the simulation example at the end.

x = 1, 2, 3, 4, 5  estimate    a.hat  b.hat      mean +0.00001 1.9999 x = 1, 1, 3, 5, 5  estimate    a.hat  b.hat      mean -0.00003 2.0000 x = 1, 1, 1, 5, 5  estimate    a.hat  b.hat      mean +0.00011 1.9999 x = 1, 1, 1, 1, 5  estimate    a.hat  b.hat      mean +0.00012 1.9999

Very small differences in the estimated intercept, practically identical slope.

lr <- function(x, y, n) { # standard equations are faster than lm(y ~ x)   sum.y <- sum(y)   sum.x <- sum(x)   b.hat <- (sum(x * y) - sum.x * sum.y / n) / (sum(x^2) - sum.x^2 / n)   a.hat <- sum.y / n - sum.x / n * b.hat   return(c(a.hat, b.hat)) } a     <- 0   # intercept b     <- 2   # slope nsims <- 1e5 # number of simulations x.n   <- 5   # number of x-levels x.min <- 1   # minimum level x.max <- 5   # maximum level x1    <- x2 <- x3 <- x4 <- seq(x.min, x.max, length.out = x.n) x2[c(c(1:2), c((x.n-1):x.n))] <- c(rep(x.min, 2), rep(x.max, 2)) x3[c(c(1:3), c((x.n-1):x.n))] <- c(rep(x.min, 3), rep(x.max, 2)) x4[1:4] <- rep(x.min, 4) s1    <- data.frame(sim = rep(1:nsims, each = x.n), x = x1) # singlets s2    <- data.frame(sim = rep(1:nsims, each = x.n), x = x2) # 2 duplicates s3    <- data.frame(sim = rep(1:nsims, each = x.n), x = x3) # tri- and duplicate s4    <- data.frame(sim = rep(1:nsims, each = x.n), x = x4) # quadruplet t     <- c(paste("x =", paste(signif(x1, 3), collapse = ", "), "\n"),                  paste("x =", paste(signif(x2, 3), collapse = ", "), "\n"),                  paste("x =", paste(signif(x3, 3), collapse = ", "), "\n"),                  paste("x =", paste(signif(x4, 3), collapse = ", "), "\n")) set.seed(123456) s1$y <- rnorm(n = nsims * x.n, mean = a + b * s1$x, sd = 0.5) set.seed(123456) s2$y <- rnorm(n = nsims * x.n, mean = a + b * s2$x, sd = 0.5) set.seed(123456) s3$y <- rnorm(n = nsims * x.n, mean = a + b * s3$x, sd = 0.5) set.seed(123456) s4$y <- rnorm(n = nsims * x.n, mean = a + b * s4$x, sd = 0.5) c1    <- c2 <- c3 <- c4 <- data.frame(sim = 1:nsims, a.hat = NA_real_, b.hat = NA_real_) pb    <- txtProgressBar(style = 3) for (i in 1:nsims) {   c1[i, 2:3] <- lr(s1$x[s1$sim == i], s1$y[s1$sim == i], x.n)   c2[i, 2:3] <- lr(s2$x[s2$sim == i], s2$y[s2$sim == i], x.n)   c3[i, 2:3] <- lr(s3$x[s3$sim == i], s3$y[s3$sim == i], x.n)   c4[i, 2:3] <- lr(s4$x[s4$sim == i], s4$y[s4$sim == i], x.n)   setTxtProgressBar(pb, i / nsims) } close(pb) comp  <- data.frame(estimate = rep("mean", 4),                     a.hat = sprintf("%+.5f", c(mean(c1$a.hat), mean(c2$a.hat),                                                mean(c3$a.hat), mean(c4$a.hat))),                     b.hat = sprintf("%.4f", c(mean(c1$b.hat), mean(c2$b.hat),                                               mean(c3$b.hat), mean(c4$b.hat)))) cat(t); print(comp[1, ], row.names = FALSE) cat(t); print(comp[2, ], row.names = FALSE) cat(t); print(comp[3, ], row.names = FALSE) cat(t); print(comp[4, ], row.names = FALSE)

Dif-tor heh smusma 🖖🏼 Довге життя Україна! Helmut Schütz The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes  Ing. Helmut Schütz 