Normal distribution assessment [General Sta­tis­tics]

posted by Helmut Homepage – Vienna, Austria, 2022-02-20 23:59 (589 d 17:38 ago) – Posting: # 22799
Views: 1,413

Hi BE-proff,

❝ Let's say I have randomly generated set of 1 million values.:surprised:


Before we apply a statistical method, we have to understand the [image] data generating process.
Therefore: How did you generate your data set? Obtained from RANDOM.ORG? With a hardware random number generator? Software? If yes, which PRNG? In the last case, most software (even Excel since 2010) implement the Mersenne Twis­ter, which is with its period of ≈4.3×106,001 fine for generating large data sets. However, in VBA still an LCG is im­ple­mented, which is bad for large data sets due to its shorter period.

❝ What criterion should be used to check if the set has normal distribution? :confused:


Look at the histogram first. ;-)

set.seed(123456)                                  # for reproducibility
x   <- rnorm(1e6, mean = 0, sd = 1)               # or your data instead
lim <- c(-max(abs(range(x))), max(abs(range(x)))) # for the plots
hist(x, breaks = "FD", freq = FALSE, xlim = lim, col = "bisque", border = NA, las = 1)
rug(x, side = 1, ticksize = 0.02)
legend("topright", x.intersp = 0,
       legend = c(paste("mean(x) =", signif(mean(x), 6)),
                  paste("sd(x) =", signif(sd(x), 6))))

Does it look normal? Happy with the mean (should be ≈0) and the standard deviation (should be ≈1)?
If in doubt, overlay it with a kernel density estimate.

lines(density(x, n = 2^10), lwd = 3, col = "#FF000080")

Does it match? Not sure? Overlay the normal distribution.

curve(dnorm, lim[1], lim[2], n = 2^10, lwd = 2, col = "#0000FF80", add = TRUE)

Still in doubt?

plot(lim, lim, type = "n", xlab = "Theoretical Quantiles",
     ylab = "Sample Quantiles", main = "Normal Q-Q Plot", las = 1)
grid()
qq <- qqnorm(x, plot.it = FALSE)
points(qq$x, qq$y, pch = 21, cex = 1.25, col = "#87CEFA80", bg = "#87CEFA80") # patience...
qqline(x)

If you insist in a test comparing the data’s empirical cumulative distribution function to the cumulative distribution function of the standard normal:

ks.test(x, "pnorm" , alternative = "two.sided")


Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

Complete thread:

UA Flag
Activity
 Admin contact
22,770 posts in 4,777 threads, 1,627 registered users;
26 visitors (1 registered, 25 guests [including 9 identified bots]).
Forum time: 18:37 CEST (Europe/Vienna)

The real struggle is not between the right and the left
but between the party of the thoughtful
and the party of the jerks.    Jimmy Wales

The Bioequivalence and Bioavailability Forum is hosted by
BEBAC Ing. Helmut Schütz
HTML5