The “sin” (sinus or evil action) destroying the “normality statistical test”
|Not-normal Sinus bells]
by Daddy Tophe Meunier, French tourist in Bacolod City, July 10th-15th 2015, updated July 18th + August 8th & 10th
(written in English in case my niece Raisha, with mathematical talent, could check/correct/improve)


Introduction
10 parts
Conclusion
(supplement 1)
(supplement 2)
(supplement 3)

Introduction
Parametric statistical tests require distributions of data to be normal and a mistake is usually performed (up to FDA/NCBI guidelines) about the checking of this normality : what is verified is the “possibility” of normality and not at all the “proof” of it (with huge misunderstanding about the risk concerning the conclusion, the claimed risk being almost opposite to the hidden risk actually taken). To understand how big this mistake is, it is necessary to add an alternative hypothesis H1 supplementing the null hypothesis H0 of Normal/Gaussian distribution, in order to perform a true hypothesis test, not a pretended checking.

1- Sinus principle
The sinus shape is not a bell but a S-shape going up and down between -1 and +1:

Anyway, if you take only one half turn of it, between 0 and 180, this is a dome. And if you take the square of it, you get a bell (even more flattened with its square, or power 4, 6, 8 and so on) :

This denies completely the lesson “you have a bell shaped distribution so this is normal, Gaussian”. No, there are other bells, with a main advantage: the limit of it is a precise figure, not infinite. For instance, if you are measuring the size of a metal screw, you may have a sinus bell shape telling you the minimum cannot be below zero, while the Gaussian bell shape is crazy pretending negative seize is rare but very possible (and pretending as well that a size of the screw being bigger than the Solar System is also rare while its frequency is not zero – which is completely wrong, due to the size of the machine building it inside itself).
So, we are going to prepare a process for defining H1 as the best sinus corresponding to the data (or expected data, if this must be defined before actual measure). Usually, the data gives: average, standard deviation (and encountered minimum and maximum). How to define the sinus law which is related to this? For instance, here above, the source was 0 to 180, and not average 0.230” std.dev.0.020”, there is a need to establish the settings for matching. The average (and/or mode, median) is 90 for the apex of the bell, but how to handle the standard deviation on one side, the minimum-maximum and power on the other side?

2- First study of the sinus^k law
For the sinus graphic between 0 and 180, the sinus function gives the height but not the standard deviation. I do not know the general calculation expressing the standard deviation of a sinus law but it is possible to measure it through a little model. I used an Excel spread sheet, with each integer angle corresponding to a number of measured pieces, the maximum class including 200 pieces. This is not the mathematical answer on the continuous sinus law but an approximated value.
X= degrees, Y=sin(X/180*PI() radians)*200 pieces averages to closest integer (with Excel default rule to class mid-points : to the integer above).
(Note : instead of the standard calculations rules m=sum(x)/N and s=sqrt(sum(x-m)/N), I have (reinvented then checked and) used m=sum(n*x)/sum(n) and s=sqrt(sum(n*(x-m)/sum(n)))).
The result, with 181 classes 0, 1 to 180, and 0 to 200 pieces per class is:
Sinus: 22,920 pieces --> average 90, standard deviation: 39.1672, minimum: 1, maximum: 179
Sinus^2: 18,000 pieces --> average 90, standard deviation: 32.5361, minimum: 3, maximum: 177
Sinus^4: 13,502 pieces --> average 90, standard deviation: 25.4588, minimum: 13, maximum: 167
Sinus^6: 11,250 pieces --> average 90, standard deviation: 21.5897, minimum: 22, maximum: 158
This is very interesting but several things introduce further questions:
- Is the decrease of Standard Deviation related to the power increase (as expected) or to the number of pieces decrease (unexpectedly)? A standardized checking should be performed, with the biggest amount per class being possibly different from 200. For instance with a total of about 20,000 pieces, constant, and variable highest value per class to reach this.
- Is the Standard Deviation measurement precise or highly impacted by the chosen size? Of course minimum and maximum, averaged to 1 piece or more, depends completely on this size (and maybe depends on the step, here 1.0) but what about the SD?
- As the step between k=2 or 4 or 6 is high, can we perform the same with k=3 or k=3.5 even if this does not exclude negative values anymore (outside the 180 range, though)?

3- Complement to the sinus^k study
First, the odd figures and decimal figures seem welcome:

Now about size effect on SD (and minimum-maximum), a further check gives this:

So, if I have a data distribution of average 90 and SD 25.5 the perfect H1 would be sin^k with k=4 (k=3 if SD=28.4). Practically this means calculating the class x theoretical number of pieces with X=x/180*Pi and Y=sin^k of this. But…
- What if the average is 110 not 90? (multiply or add to change the origin ?)
- What about the number per classes when there are not 20,000 measures but 200?

4- First adaptation to a “not degree distribution”
As I said almost by fantasy, at random, a classical data could give : Average 0.230” SD 0.020”. The Gaussian model (H0) for that is theoretically the law 1/ (2*Pi) * exp (-(x-m)/2SD), but what about the related sinus?
Obviously there are 2 kinds of parameters: Di for changing the degrees into inches and Si for changing the scale to match the SD of the data.
A/ If we adjust first the maximum, there are also 2 ways:
  A1/ Addition: D1= Average-90 (here D1=-89.77), so 90--> 0.23 and 89.77-->0
  A2/ Multiplication: D2=Average/90 (here D2=0.002555), so 90-->0.23 and 89.77--> 0.2294
B/ Then changing the scale should provide the result, but this is less easy then expected: of course the SD 0.02” can be obtained from 32.5 (sin) or 21.6 (sin^6) [or these values multiplied by D2, if the A2 way is chosen) but which one should be chosen?
I think I need another parameter, which could be the kurtosis of the distribution. According to Wikipedia, this kurtosis E[(x-m)^4/SD], with zero result for normal/Gaussian, describes the sides (narrow or big), compared to extremes (average and rare values), so this is loosing information, and I prefer checking part by part the sinus^k distributions.
Anyway, it is not sure that such a conversion is necessary, if the shape may be analysed by itself through some standardisation, like the data x (giving m, SD) are often compared as x-m/SD to Normal(0,1) rather than directly to Normal(m,SD).

5- Standardized comparison
In a normal/Gaussian distribution, there are several famous laws: 90% of the values are inside m1.645 SD, 95% inside m1.96 SD (almost 2 SD), 99.73% inside m3 SD
And in the same way :

This could be compared on about 20,000 pieces to our sinus distributions. But, in order to ease the calculation of frequency, I prefer another way, with sum at the lower end of distribution:

Then I could complete my previous model sheets (sin^1.0 to sin^7 or more if interesting) with 2 columns: x’ = (x-average)/SD (aka "t") and p’=sum of numbers/N (aka "cumul.f"). If I get a x’ for p’ = 4% and for 7%, I will not create a new model but estimate linearizing between both values.
And this gives the following:

Here, the best (closest to normal) seems the sin^8.5 law, having the minimum relative difference with the normal distribution: 3.07% (for the 8 points arbitrarily chosen here).
Other interesting points:
– sin^6 has 95% of its measures in 1,960 SD like Gauss
– sin^7 has 95.45% of its measures in 2,00 SD like Gauss
This sin^8.5 having been added in order to get an optimum, it is useful to add the details below like for the first ones:
181 classes 0, 1 up to 180, 0 to 419 pieces per class, total 20,037 pieces --> average 90, standard deviation: 18.5584, minimum: 27, maximum: 153 So, my decision is to take sin^8.5 as the most normal sinus distribution.
Now, this is ready for Comparison tests. Either Shapiro-Wilks, or Kolmogorov-Smirnov. But the imperfect model with 181 discontinuous steps should be checked before, not to loose time if there is a bias in it.

6- Gauss in the same approach
Using the same 181 classes and about 20,000 pieces, would a normal distribution give a perfect result (compared to the theoretical continuous normal law) the way I analyzed it?
To check this, I have selected the 0 to 180 range, degree by degree. With an average of 90 and first a SD 30 (for almost all in 3SD), alas there was not 0 piece at 0 and 180 but much more, which means this was not at all like the sin^k. So I have selected as SD: 90/4=22.5 (for all in 4SD, 0.3 piece being rounded to zero), and the model, with a multiplicator 20,010 gave a total 20,001 pieces, with average 90 and SD 22.49 (close to SD of sin^5.5). The x’ checking gave also surprising results:

The average relative result -1.3% (like sin^8.5) confirms that sin^8.5 is the best, without need to calculate all again with more classes and a different step, nor turning to continuous with searches in books for integral calculation in trigonometrics.
With this discontinuous tool, it is also possible to add a kurtosis test and check this (its value being 0 for a Gaussian distribution, according to Wikipedia). But Excel kurtosis function has a maximum of 256 values and does not handle a table of several x with their related n. Well, I have programmed a column with n*(x-m)^4, and this gives a final result of beta2=753,657 (gamma2=7563,654) not zero at all. Thinking about it, the zero value seems impossible: a power 4 transforms the negative values into positive, so the average of all positive values will be positive and not at all zero.

7- Normality test of sinus^k
Before using sin^k (with k=8.5 probably) as an H1 hypothesis, let us try to check the normality of them.
The Wikipedia definition for the Shapiro-Wilk test is unclear, with no table to transform the calculated W into a p-value, and Excel does not provide a calculation formula for W nor an inverse formula for its p-value, and the definition of W is not clear (with unexplained T and covariance matrix).
(What is interesting though is that the Shapiro-Wilk test rejects significatively the normal hypothesis H0 if p-value below 5% and does not reject it if p-value is above 5%). I would add: it would be completely erroneous to think this non-rejection with 5% risk (of wrong rejection) means an acceptance with 5% risk (of wrong acceptance): with 5% threshold are accepted the p-values 80% and 8% but are rejected the p-values 2% and 0.002%, very improbable according to the normal law, BUT the mistake in interpretation would pretend these last ones are also accepted with a tiny 0.001% risk (of acceptance because accepting normality hypothesis is what is done), which is self-contradiction.
Well, let us turn to Kolmogorov-Smirnov test: the Wikipedia explanation of it seems very complicated, but I try to simplify it. The maximum absolute difference of cumulated frequency and cumulated normal probability multiplied by sqrt(N) has a probability 5% to be above 1.36. The result is this:

And this is simply wonderful:
– sin^k is confirmed being significantly different from normal (result 1.96 to 4.50 > 1.36)
– Gauss modelized our way into 181 not infinite steps at 4 SD and 20,000 pieces is not rejected as different from normal (result 1.26 < 1.36)
– sin^8.5 is confirmed the closer sinus to normal (result 1.96 as minimum of all sinus)
– there is a tiny difference between sin^8.5, 8, 9, so no need to look for a better approximation like k= 8.3 or 8.7 instead of 8.5.

8- Final test, true
Usually, a test of normality is performed only against H0 normal at 5% risk (rejected if probability <5%, not rejected if probability >5%). I have found 2 examples of it:
A/ In French Wikipedia “droite de Henry”:
Notes allways between 0 and 20 with this day in this class: 10% values up to 4, 30% up to 8, 60% up to 12, 80% up to 16, with a conclusion: normal. With calculated t and graphically estimated average and SD.
These round figures 10% by 10% seem to mean 10 pupils, with 1 up to 4, 3 up to 8, 6 up to 12, 8 up to 16 (and 10 up to 20). But why not giving the 10 notes to understand perfectly the basis of calculations (as there are several ways to calculate)? I cannot succeed in calculating this SD, in several hypothesis (up to 4 could be 4 or be 2, and so on):


So I try an interpretation of mine, for these data, and related calculations:

The result is impressive : yes, the normal hypothesis is not rejected (result <1.36) BUT the alternative hypothesis H1 (sin^8.5) is not rejected either, so the conclusion is “no conclusion at all”, and not “H0 (normality) is confirmed”. Moreover, H1 is closer to the data than H0 (lower maximum difference) so… choosing H0 (like Wikipedia concluding "confirmed normal", and I was disagreeing) is taking a risk above 50% in a Bayesian equiprobability context, while the announced risk (for result <1.36) was “less than 5%”. This is a mathematical mistake and a shame all at once (and dishonnesty if money is involved somehow, like for acceptance/validation of a commercial product).
A graphical aspect helps showing how close normal and sin^8.5 are:


B/ In CLSI guideline about capability (EP17-A2 page 55/80 = 45/67): (converting the histogram into counts and performing calculations of mine):

Here is similar while having a different result: yes, the normal hypothesis is not rejected (result <1.36) BUT the alternative hypothesis H1 (sin^8.5) is not rejected either, so the conclusion is also “no conclusion at all”, and not “H0 (normality) is confirmed”; while, here, H1 is not closer to the data than H0 (higher maximum difference). More data is mandatorily required to depart H0 and H1 hypothesis.
The graphical aspect shows how imperfect both normal and sinus models are, similar to one another but the data shows a different shape:


9- Double-checking and consequences
For the Henry data, the normal law calculated there above was 2.6% of values below zero or equal to it (and sin^8.5 said 2.7%) while a further step down provides absurdity; 1.7% estimated frequency for minus-1 and below, while a negative score is impossible by principle, the overall range being 0 to 20, strictly:

This means precisely that both H0 and H1 are completely wrong, with a 100% risk of error in accepting one of them.
Saying “1.7% of values are below minus-1” of course is a description on the whole population, with 0.0% still possible on one rare sample (just as 3.4%, the other side of 1.7%1.7%) BUT another sample or many other samples would completely deny this: we have not “sometimes 0.0% and 3.4% with a tendency towards 1.7%”, we have “always 0.0% (0.0%)” for minus-1 and below. So the process of analysis was completely wrong. And this is confirming the true analysis: when H0 and H1 are both not-rejected, you must not accept H0 but you must increase a lot your number of values, up to reject one of them (or both like here, finding a proper H1’).
Besides, one of the main advantage of the sin^k hypothesis (not infinite minimum-maximum) has been lost, trying to be as close as possible to a normal law. Another approach would be not to copy a normal law but to fit the data.

10- Far from normal but better?
For this new approach, the basis is 0% for minus-1 (and below), 0% for 21 (and above), rejecting completely the Gaussian model, proven wrong.
A sinus^k with 0 for 0/20 and 180 for 20/20 would not be appropriate, though, because 0 and 20 would give 0% (0.0%0.0%), while they are very possible. So it is better to say -1 (“/20”) to 21 (“/20”) are 0 to 180, in 23 classes (1 to 20, + 21 and 0 and -1).
As the average is 11, not the central 10, this is not simply a sin^k with central maximum.
The correspondence between score/20 and degree would be:
-1 --> 0 ; Average 11 --> 90 (with 0 to 10 by step of 90/12 = 7.5) ; 21 --> 180 (with 12 to 20 by step of 90/10 = 9)
That could mean a sin^k below the average and sin^k’ above the average. To explore this way, I will come back to the raw analysis of a few k, with a step 2 (k=1, 2, 4, 6, 8).

There are several learnings from this:
– Getting about 60% for the average 11 related to the asymmetric problem mentionned above, though these 5 very imperfect asymmetrical sinus (sin^k with k= 1 or 2, 4, 6, 8) are not rejected (result below 1.36), which means the normality checking is not a challenge at all, so many alternative hypothesis being not rejected either.
– Surprisingly, the sin^8 is not at all the best here but sin^1 (and with worse adjustment then was got using normal law, and sin^8.5 with impossible multitude of negative values like the “normal” way).
– The wrong normal way (and big-range sin^8.5) giving better results than the optimum here, no need to have an asymmetrical below/above power (it seems power 2 below average would decrease the 13.9% figure to 11.1%, and power 1 above would decrease the 17.0% to 13.4%, but the maximum would stay 13.4%, above the normal 11.1% in a previous chapter).
– Another approach could be "keeping the symmetrical sinus centered on the average with maximum above 21 (e.g.23) and making the sum 20=20+above, 0=0 +below, but that would create a multimode Russian mountains distribution, far from the expected "very simple model".
This means this kind of adjustment is completely wrong, it would be far better to graphically use Bezier nodal points on the cumulative frequency, with a direction and length at the minimum, a direction at the average with different lengths below and above, a direction and length at the maximum. This would give a smooth sinusod, allowing caculations, but most of the time this will not be the normal/Gaussian way at all.
Basis:

The fact that the sinus do not give 50% for the average 11 is due (as explained above) to the imperfect asymmetry, having different steps below and above average. As this sinus^k with k=1 or 2 have been judged worse than the big-range sin^8.5, we will not insist much trying to correct this.
Approximation is performed manually/visually using Corel Draw X7: (the new estimate being the black thick line)

This can be modeled mathematically by 2 parabolic curves, one above the average (p=a1x+b1x+c1) and one below (p=a2x+b2x+c2), with the same slope at the average (a1xM+b1=a2xM+b2). Or without this cautious step, the paraboles could be calculated through 3 points, extrema and middle.
There are (-1, 0%); (11; 50%); (5; 18%) and (21, 100%); (11, 50%); (16, 83%)
And ax+bx+c=y --> a(x1-x2)+b(x1-x2)=y1-y2
--> a(x1-x2)(x1-x3)+b(x1-x2)(x1-x3)=(y1-y2)(x1-x3)
and a(x1-x3)+b(x1-x3)=y1-y3
-->a(x1-x3)(x1-x2)+b(x1-x3)(x1-x2)=(y1-y3)(x1-x2)
--> a[(x1-x2)(x1-x3)-(x1-x3)(x1-x2)]=(y1-y2)(x1-x3)- (y1-y3)(x1-x2)
-->a=[(y1-y2)(x1-x3)- (y1-y3)(x1-x2)]/[(x1-x2)(x1-x3)-(x1-x3)(x1-x2)]
And b=[(y1-y2)-a(x1-x2)]/(x1-x2)
And c=y1-ax1-bx1
With the data above this gives :

And the final Kolmogorov test gives :

Yes, the result is a further improvement, with Kolmogorov index c (0.26) even lower (smaller differences data-model) than the normal’s one (0.35), and this without any need to pretend there is something at minus-1 (and below) or at 21 (and above). The normal/Gaussian way was not at all the true optimum but a very simplified and very wrong model.

Conclusion
The sinus^k model analysis has proven that :
– Usually, when the normal hypothesis is not rejected, another bell-shaped model is not rejected either, so accepting normality as true is a mistake (even before that, it was easily demonstrated wrong to declare it proven acceptable with a small risk like 5% while this is the rejection risk, out of the subject, and not the acceptance risk, here relevant).
– Usually, the normal hypothesis is completely wrong (with a 100% risk of mistake) because it pretends there are values up to infinites even when this is completely impossible; the good way is not “checking” if normality is rather close but optimizing the sinusoid curve describing the cumulated frequency/probability.
– Here have been developed an improved way of modelizing cumulated frequencies, with vectorial drawing software, 3 Bezier nodal points, 2 parabolas (far better than all sinus at the end), but a calculated parabola may be not all-growing and a best-parameter calculation may ease and objectivate (and still improve) the process.
--> The normality/Gaussian checking test has here been killed and burried. Without flowers: it is a shame that it was officially approved (1900-2015? even if I protested already in 1985, aged 21, against such industries'/searchers'   "proofs by non-significance").
    [Tophe in holidays]

Supplement 1 - No double-parabola either (July 18th 2015)
Doing the same with the second example provided interesting news: (the new estimate being the light-blue thick line)

This seemed the same as Henry notes but calculations went in a different direction, as mentionned in the conclusion: the risk of 3 points defining a parabola was that the law is not continuously growing but may be going up and down. This going up and down was not fitting the Bezier drawing at all (even if the Kolmogorov result is better than "normal"):

This can be seen whenever the extremum falls within the range of use. And this extremum of y=ax+bx+c is the point where the derivative y=2ax+b=0, ie: x= -b/2a:

So another way than double-parabola seems necessary to be re-invented. The Bezier Wikipedia explanation seems very complicated, but I may invent something again, for the particular case of simple slope evolution. The minimum has a vector with slope and "length" and the maximum also, so there may be a continuous evolution of this slope, with the weight of corresponding vector lengthes (if lengthes are very short, this is almost the straight line between points; if lengthes are very long, this is almost the non-continuous broken line following both slopes up to intersection). The third point could fix that.
Alas, there are many different/discrepant ways, even between 3 points:

Everything seems possible, playing manually with the lengthes of the vectors (in the slope's directions). Well, I do not pretend here to find the formula for getting the optimum fit of a continuous law matching the data, I am just trying to show "far better than normal (and without contradiction like probability >100%)", so I am going to take simply an example, hoping to get on CLSI data a Kolmogorov result lower than the normal 0.55 (and lower than the 0.49 of ">100% law"). Here my choice have been a slope 0 at the maximum of CLSI data or a point a little further (extremum of curve) and that would pass by the mid-point.
So, the extremum is at x0,y0 for Y=aX^K --> (y-y0)=a(x-x0)^K. So there are 2 unknown (a, K) but there are 2 equations for that if we have 3 points (the extremum x0,y0 and then x1,y1 and x2,y2). This gives (y1-y0)=a(x1-x0)^K and (y2-y0)=a(x2-x0)^K --> a=(y1-y0)/(x1-x0]^K=(y2-y0)/(x2-x0]^K --> [(x1-x0)/(x2-x0)]^K=(y1-y0)/(y2-y0) --> K=log[(y1-y0)/(y2-y0);base (x1-x0)/(x2-x0)] and a=(y1-y0)/(x1-x0]^K then y=[a(x-x0)^K]+y0.
The new curve with horizontal tengents at minimum and maximum is: (the new estimate being the light-blue thick line)

Not surprisingly, these new points would fail with parabolas:

But that works with the polynom ^K:

[Due to Excel problems with decimal powers of negative numbers, I had to use weird formulas =-(D17-D16)/(ABS((C17-C16))^E16) and =-F$16*ABS(C16-C$16)^E$16+D$16, instead of the ~16 version of the ~13: =(D14-D13)/((C14-C13)^E13) and =F$13*(C13-C$13)^E$13+D$13. Such false problems are confusing but solvable it seems.]
And the final result is this:

(I used truncation of the values below 0% and above 100%, this is more satisfying even if the result is unchanged, this corresponds to starts at 0% and finish at 100% without horizontal tengent, this is not a problem and that was even the first project. This is different from a cumulated probability going over 100% thang going down which is impossible.)
The result is confirmed much improved (0.34 below 0.49 imperfect and bad "normal" but officially accepted 0.55). So, that is not a sinus that have been chosen but a (truncated) double polynom, far better (and far far better than the wrongly validated normal law), and this is a curved S-shaped cumulative probability, what is a "sinusoid", abnormal, so much better than normal.

Supplement 2 - Not the same at all (August 08th 2015)
On the graphics above, paragraph 8 (A&B), the situation looked like "sin^8.5 is almost exactly Normal", so it would be silly to conclude like I did "normality is not proven because the sin^8.5 hypothesis is not rejected at all, either".
Well, the text said there is a complete difference on extreme values going to infinite, but there was no graphical illustration of this. Here it is, with cumulated probabilities corresponding to number of standard deviations (from minus 8 to minus 1, -1.96 giving 2.500% in the normal way):

While the Normal/Gauss way pretends there is something up to infinites, sin^8.5 is completely different with absolutely nothing below -3.5 standard deviations.
So, it is confirmed: the way normality of data is validated is completely wrong, without considering other bell shape candidates, that could be (and usually are) more correlated with the data, measured and possible.

Supplement 3 - Still the size effect (August 10th 2015)
Reading again what is written above, as a deep and progressive reflexion on the normal-check, it seems there is a contradiction: I verified that sin^1 to sin^9 are not normal (chapter 7) then I said that the data of chapters 8A and 8B would fit both normal and sin^8.5 hypothesis. So: lately here, I suggest an objection, finding a contradiction there.
But… this is not a true contradiction, and solving it may help understanding:
a/ On a huge theoretical “sample” of about 20,000 pieces in 180 classes, I demonstrated that sin^1 to sin^9 (including sin^8.5) are not normal
b/ On a true small sample of 10 pieces in 22 classes (8A) or medium sample of 120 pieces in 35 classes (8B), I failed to reject both normal and sin^8.5
c/ Now I show that on a theoretical medium sample of about 120 pieces in 180 or 30 classes following sin^8.5 distribution, I fail to demonstrate that it is not normal:

(Sorry, the comma here above means decimal point, which is a comma in the French way - holydays are finished and my computer is no more in Filipino English but French).
This demonstration means (I repeat it once more) : when you fail to reject the normal hypothesis, that does not mean your data are proven normal (with a small risk), that proves you have not gathered enough data and so you would probably fail to reject another bell as well.