Saturday, December 03, 2022

Fun with The Normal Distribution Curve

 I came across on the web that for 2022, the Mean (the statistics word meaning average) and the standard deviation for household income in United States was $73,000 and $30,000 respectively.  Standard Deviation  is a formula to show the spread of numbers clustered around the average.  It is heavily relied on in statistics, business and science.  Those two numbers above are approximate numbers, the web site I looked at only published the approximate numbers. This standard deviation grouping forms a curve known as Normal Distribution Curve or more commonly known as the Bell Curve.  Warning, not all groups of random numbers form in this bell curve.  The easiest test to determine if it has formed a bell curve is to check if approximatly 68% of your data falls within plus or minus one standard deviation of mean, 95% falls within plus or minus 2 deviations of mean and 99.7% falls within plus or minus 3 deviations of mean.

What is this Standard Deviation thing good for anyway.  Well if you have a begin and end point, you can determine what percent of your population falls between those numbers.   To do this, one has to convert the two numbers into z variables using the formula

z=(x-Mean)/STD

Mean of course is the average and STD is the Standard Deviation. And then historically one would look these values up in a table and then to get it in a percent, we would have to multiply it by one hundred. 

Now some time ago I wrote some routines to calculate these percents in Pascal and I put them in my own personal library.  I used the hand method (I guess that is the only way to do it with the Bell Curve) which basically means that I broke the area up under the curve into a bunch of rectangles and calculated those areas, but to improve the results I made little triangles at the top of each rectangle to grab most of the curved area above each rectangle.  I have run tests and am getting fairly accurate numbers out of my function.  It's actually not that much source code:

Function Zee(std, Mean, Val: Double): Double;
Begin
	Zee := (Val - Mean) / std;
End;

Function CalculateHeight(sv, lv, x: Double): Double;
Begin
   CalculateHeight := sv * Exp(lv * 0.5 * x * x);
end;

Function BellCurveArea(BeginVal, EndVal: Double): Double;
{
         This function calculates the area under
         the Bell Curve using the hand method.
         The line for the curve is equal to:

         (1/(2 * Pi)^.5) * e^(1/x^2)
}
Var
   x, y1, y2, YVal, Sum, SqrtVal, LogVal: Double;
Begin
   SqrtVal := 1.0 / Sqrt(2.0 * Pi());
   LogVal  := Ln(1.0 / Exp(1));
   Sum := 0; x := BeginVal;
   Repeat
      y1 := CalculateHeight(SqrtVal, LogVal, x);
      y2 := CalculateHeight(SqrtVal, LogVal, x + 0.01);
      If (y1>=y2) Then
         YVal := y2
      Else YVal := y1;
      Sum := Sum + YVal * 0.01;
      // This adds the triangle .5(h * w)
      // Width = .01 and multiply that by .5 = .005
      Sum := Sum + Abs(y2-y1) * 0.005;
      x := x + 0.01;
   Until (x > EndVal);
   BellCurveArea := Sum;
End; 

Function RPad(s: String; n: LongInt): String;
Var
	i:					LongInt;
Begin
	If Length(s) < n Then
		For i := Length(s) To n Do
			s := s + ' ';
	RPad := s;
End;

Now for the fun.  To take the Mean and  the Standard Deviation and get some numbers (and Graphs) out of it.  First I wrote a little Lazarus program to produce a graph:

The source code to produce this isn't very difficult, all I did was create some arrays:

HighArray: Array[1..26] of Double = (10000, 20000, 30000, 40000, 50000, 60000,
70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000,
150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000,
230000, 240000, 250000, 1e6);
LowArray: Array[1..26] of Double = (0, 10000.01, 20000.01, 30000.01, 40000.01,
50000.01, 60000.01, 70000.01, 80000.01, 90000.01, 100000.01,
110000.01, 120000.01, 130000.01, 140000.01, 150000.01, 160000.01,
170000.01, 180000.01, 190000.01, 200000.01, 210000.01, 220000.01,
230000.01, 240000.01, 250000.01);
NameArray: Array[1..26] of String = ('0-10k', '10k-20k', '20k-30k', '30k-40k',
'40k-50k', '50k-60k', '60k-70k', '70k-80k', '80k-90k', '90k-100k',
'100-110k', '110k-120k', '120k-130k', '130k-140k', '140k-150k',
'150k-160k', '160k-170k', '170k-180k', '180k-190k', '190k-200k',
'200-210k', '210k-220k', '220k-230k', '230k-240k', '240k-250k',
'250k-One Mill.');

And then the following procedure was added:

procedure TForm1.CalcButClick(Sender: TObject);
Var
i: LongInt;
Mean, Std, Pct: Double;
begin
For i := Chart1BarSeries1.Count Downto 1 Do
Chart1BarSeries1.Delete(i-1);
Mean := StrToFloat(MeanEdit.Text);
Std := StrToFloat(StdEdit.Text);
For i := 1 To 26 Do
Begin
Pct := BellCurveArea(Zee(Std, Mean, LowArray[i]),
Zee(Std, Mean, HighArray[i]));
Pct := Pct * 100;
Chart1BarSeries1.Add(Pct,NameArray[i],clred);
end;

end;


Now some people including me like to see the actual numbers:

Mean: $73,000.00
STD:  $30,000.00
Household Earnings             Percent
0-10k                          1.06798031%
10k-20k                        2.13399700%
20k-30k                        3.81964654%
30k-40k                        6.12427552%
40k-50k                        8.79611544%
50k-60k                        11.31706203%
60k-70k                        13.04320719%
70k-80k                        13.46615998%
80k-90k                        12.45408995%
90k-100k                       10.31783691%
100-110k                       7.65724858%
110k-120k                      5.09052615%
120k-130k                      3.03149098%
130k-140k                      1.61715203%
140k-150k                      0.77275731%
150k-160k                      0.33077244%
160k-170k                      0.12682508%
170k-180k                      0.04355779%
180k-190k                      0.01340004%
190k-200k                      0.00369250%
200-210k                       0.00091139%
210k-220k                      0.00020149%
220k-230k                      0.00003990%
230k-240k                      0.00000708%
240k-250k                      0.00000112%
250k-One Mill.                 0.00000018%

I generated that with the following simple little pascal program, the curly brackets with the $I just pull in my personal library.


Program HouseHoldInc2022;
Uses SysUtils, DateUtils;
Const
HighArray: Array[1..26] of Double = (10000, 20000, 30000, 40000, 50000, 60000,
70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000,
150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000,
230000, 240000, 250000, 1e6);
LowArray: Array[1..26] of Double = (0, 10000.01, 20000.01, 30000.01, 40000.01,
50000.01, 60000.01, 70000.01, 80000.01, 90000.01, 100000.01,
110000.01, 120000.01, 130000.01, 140000.01, 150000.01, 160000.01,
170000.01, 180000.01, 190000.01, 200000.01, 210000.01, 220000.01,
230000.01, 240000.01, 250000.01);
NameArray: Array[1..26] of String = ('0-10k', '10k-20k', '20k-30k', '30k-40k',
'40k-50k', '50k-60k', '60k-70k', '70k-80k', '80k-90k', '90k-100k',
'100-110k', '110k-120k', '120k-130k', '130k-140k', '140k-150k',
'150k-160k', '160k-170k', '170k-180k', '180k-190k', '190k-200k',
'200-210k', '210k-220k', '220k-230k', '230k-240k', '240k-250k',
'250k-One Mill.');
{$I /home/terry/Documents/fpc/MyLibrary.inc}

Var
i: LongInt;
Pct, Mean, Std: Double;


Begin
WriteLn;
Mean := 7.3e4;
Std := 3.0e4;
WriteLn('Mean: ', '$73,000.00');
WriteLn('STD: ', '$30,000.00');
WriteLn(RPad('Household Earnings', 30), 'Percent');
For i := 1 To 26 Do
Begin
Pct := BellCurveArea(Zee(Std, Mean, LowArray[i]),
Zee(Std, Mean, HighArray[i]));
Pct := Pct * 100;
WriteLn(RPad(NameArray[i], 30), Pct:6:8, '%');
End;
End.

No comments: