Normal distribution assessment [General Sta­tis­tics]

posted by Helmut Homepage – Vienna, Austria, 2022-02-20 23:59 (788 d 22:11 ago) – Posting: # 22799
Views: 1,841

Hi BE-proff,

❝ Let's say I have randomly generated set of 1 million values.:surprised:


Before we apply a statistical method, we have to understand the [image] data generating process.
Therefore: How did you generate your data set? Obtained from RANDOM.ORG? With a hardware random number generator? Software? If yes, which PRNG? In the last case, most software (even Excel since 2010) implement the Mersenne Twis­ter, which is with its period of ≈4.3×106,001 fine for generating large data sets. However, in VBA still an LCG is im­ple­mented, which is bad for large data sets due to its shorter period.

❝ What criterion should be used to check if the set has normal distribution? :confused:


Look at the histogram first. ;-)

set.seed(123456)                                  # for reproducibility
x   <- rnorm(1e6, mean = 0, sd = 1)               # or your data instead
lim <- c(-max(abs(range(x))), max(abs(range(x)))) # for the plots
hist(x, breaks = "FD", freq = FALSE, xlim = lim, col = "bisque", border = NA, las = 1)
rug(x, side = 1, ticksize = 0.02)
legend("topright", x.intersp = 0,
       legend = c(paste("mean(x) =", signif(mean(x), 6)),
                  paste("sd(x) =", signif(sd(x), 6))))

Does it look normal? Happy with the mean (should be ≈0) and the standard deviation (should be ≈1)?
If in doubt, overlay it with a kernel density estimate.

lines(density(x, n = 2^10), lwd = 3, col = "#FF000080")

Does it match? Not sure? Overlay the normal distribution.

curve(dnorm, lim[1], lim[2], n = 2^10, lwd = 2, col = "#0000FF80", add = TRUE)

Still in doubt?

plot(lim, lim, type = "n", xlab = "Theoretical Quantiles",
     ylab = "Sample Quantiles", main = "Normal Q-Q Plot", las = 1)
grid()
qq <- qqnorm(x, plot.it = FALSE)
points(qq$x, qq$y, pch = 21, cex = 1.25, col = "#87CEFA80", bg = "#87CEFA80") # patience...
qqline(x)

If you insist in a test comparing the data’s empirical cumulative distribution function to the cumulative distribution function of the standard normal:

ks.test(x, "pnorm" , alternative = "two.sided")


Dif-tor heh smusma 🖖🏼 Довге життя Україна! [image]
Helmut Schütz
[image]

The quality of responses received is directly proportional to the quality of the question asked. 🚮
Science Quotes

Complete thread:

UA Flag
Activity
 Admin contact
22,988 posts in 4,825 threads, 1,657 registered users;
95 visitors (0 registered, 95 guests [including 3 identified bots]).
Forum time: 23:10 CEST (Europe/Vienna)

The only way to comprehend what mathematicians mean by Infinity
is to contemplate the extent of human stupidity.    Voltaire

The Bioequivalence and Bioavailability Forum is hosted by
BEBAC Ing. Helmut Schütz
HTML5