====== Statistics ======

See also [[Probability]], where we try to predict future events based on statistics.

Parent [[Statistics and Probability]].

Tell a story: good or bad.  Stretch or squeeze the axes.

sports stats\\
nba.com

Data
  * collecting
  * analyzing
  * presenting

Variability
  * 0,0,0,0,0
  * 0,235,17,5,318

Statistical Questions:  to answer, you need to collect data with variability

Population and Sample
  * subset
  * Simple random sample - random number generator, for example
  * Stratified sample - by age or gender
  * Clustered sample - randomly choose clusters (classrooms in school), then take the whole cluster
  * Voluntary - likely to introduce bias, non-random
  * Convenience - non-random
  * Block design - variation of stratified sample

Bias
  * voluntary response sampling (non random)
  * response bias (persons do not want to answer truthfully)
  * undercoverage (missing certain groups)
  * convenience sampling
  * nonresponse (not everyone sampled actually responds)
  * wording of a survey (influences the answers)

Correlation vs causality

types of statistical studies
  * Sample Study - select sample from population
  * Observational Study - looking for correlation, not causality
  * Experiment - look for causality, use control group

"margin of error"

Experiment Design
  * explanatory variable
  * response variable
  * blind
  * double-blind
  * triple-blind, people analyzing the data don't know which is which
  * matched pairs experiment, switch the control and treatment groups after a period of time

  ? Correlation Coeffecient (r)
  : Calculated as 
\begin{align}
r = \frac{1}{n-1}  \sum \left ( \frac{x_i - \overline{x} }{s_x} \right ) \left ( \frac{y_i - \overline{y} }{s_y} \right )
\end{align}

=== conventions ===

\begin{align}
y &= actual y \\
\hat{y} &= predicted y \\
\overline{y} &= mean of all y's \\
\end{align}

=== Coefficient of Determination ===

Represented by r-squared.

What percentage of the total variation in y is described by the variation in x?

What percentage of the variation described by the line?

How good is the li{ne as a predictor? 

Total variation is the squared error of y from the mean of y.

\begin{align}
r^2 &= 1 - \frac{SE_{LINE}}{SE\bar{y}} \\
SE_{LINE} &= \text{variation described by the line} \\
SE\bar{y} &= \text{total variation in y from the mean} \\
\end{align}

SSE_{LINE} && \text{variation described by the line}

Squared error for the line, the lower the value, the better the fit.

Coefficient of determination normalizes the squared error by making it 
a percentage and a probability.

The smaller the squared error, the higher the coefficient, the greater the probabiliy
the line is a good fit.

  ? Root mean square error (RMSD)
  : Standard deviation of residuals

  ? Covariance
  : expected value can be 
  * arithmetic mean
  * probability weighted sum, or probability weighted integral (in a continuous distribution)

  ? Coefficient
  : can mean multiplier, factor, scalar

===== Confidence Intervals =====

Khan Unit: Confidence Intervals

95% confidence based on 2 standard deviations

the margin of error is 2 standard deviations

the confidence interval is the $\hat{p}$ plus or minus the margin of error

we want to know the proportion of the population that favors a candidate

sample the population
  * N = 100,000
  * n = 100
  * take multiple samples
  * calc the proportion for all of the samples
  * assume the mean of the sample proportions = the proportion of the population
  * assume the sample proportions are normally distributed, the sampling distribution
  * the standard deviation can be calculated by a formula

\begin{align}
\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \\
\end{align}

calculate $\hat{p}$ as the proportion of the sample that favors your candidate

from that $\hat{p}$, n and 95%, calculate the margin of error

=== ratio, proportion, percentage, percentage proportion ===

ratio - comparison of two quantities

part-to-part ratio

proportion - equality of two ratios, can be used for interpolation

percentage - a fraction with 100 in the denominator

percentage proportion - an equality of two ratios 
where the second ratio has a denominator of 100

==== discrete, Binomial, two-point, bernoulli distributions ====

binomial - the binomial distribution with parameters n and p is the 
discrete probability distribution of the 
number of successes in a sequence of n independent experiments, 
each asking a yes–no question, 
and each with its own Boolean-valued outcome: 
  * success (with probability p) or 
  * failure (with probability q = 1 − p).

A single success/failure experiment is also called a Bernoulli trial
 or Bernoulli experiment, 

and a sequence of outcomes is called a Bernoulli process; 

for a single trial, i.e., n = 1, 
the binomial distribution is a Bernoulli distribution. 

The binomial distribution is the basis for the popular 
binomial test of statistical significance.

two-point - a random variable which can take 1 of 2 possible values

bernoulli - a two-point distributions where the two possible outcomes are 0 and 1
  * a random variable which takes the value 0 or 1
  * success, yes, true, one
  * failure, no, false, zero


==== Random Variables ====

variable x = a single value

random variable X = a number of values, results of experiments, to calculate P(X)


===== Distribution =====

median, range, IQR

box-and-whisker plot


box plot vs dot plot


===== Frequency Distribution =====

dot plot

mean absolute deviation (MAD)


central tendancy
  * mean, aka arithmetic mean, $\mu$ = average
  * median = middle number of rank ordered set
  * mode = most frequent number

spread
  * range = max - min
  * interquartile range (IQR) = diff between top half median and bottom half median
  * variance, sigma squared, $\sigma^2$ = $\frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N}$ 
  * standard deviation, sigma, $\sigma$ = $\sqrt{\sigma^2}$

with a tight distribution, mean and standard deviation work best

with a skewed distribution, median and interquartile range work best


do algebra on the variance formula


\begin{align}
\mu &= \frac{\sum_{x=1}^{N}(x_i)}{N} &&\text{mean formula}\\

\sigma^2 &= \frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N} &&\text{variance formula 1}\\

\sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \mu^2  &&\text{variance formula 2}\\

\sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \frac{\left [\sum_{x=1}^{N}(x_i)\right ]^2}{N^2}  &&\text{variance formula 3}\\
\end{align}


=== Population vs Sample ===

We can work with an entire population.\\
Or we can work with a sample from the population.\\
The math is the same except for two things.

  - The symbols are different.
    * For a population, we use 
      * $\mu$ for the mean, 
      * $\sigma$ for standard deviation, and 
      * $N$ for the number of points.
    * For a sample, we use 
      * $\bar{x}$ for the mean, 
      * $s_x$ for standard deviation, and
      * $n$ for the number of points.
  - The divisor in the standard deviation formula is different.
    * For a population, we divide by N.
    * For a sample, we divide by n-1.

==== Relative Frequency ====

y axis is percent instead of raw frequency count


==== Density Curve ====

Make histogram bars more and more narrow until the top becomes a line.

The data points can take on any value in a continuum, 
as opposed to being lumped into coarse buckets.

Area under the curve is 100%.

The curve will never go negative.

Measure the area under the density curve between two values,
to get the percentage of data points falling between those two values.
This can sometimes be estimated by calculating the area of the rectangle.

You cannot calculate the percent of a single value, 
because there is no rectangle.
The line has no width.


=== Symmetric Density Curve ===

mean and median are equal, both cut the area under the curve in half

in the bell curve, the mode is also equal to the mean and median

picture a bimodal symmetric curve

the area under the left half is equal to the area under the right half

=== asymmetric density curve ===

aka skewed

mean will be towards the long tail from the median

long tail to the right => right-skewed distribution

and vice versa

==== Probability Distribution ====

Distribution\\
Frequency Distribution\\
Binomial Distribution\\
Percentile\\
Density Curve\\
Probability Distribution\\
Cumulative Probability Distribution\\


frequency - table of actual results

relative frequency - division to create percentage

probability = theoretical probability = relative frequency of the entire population

$$ probability = \frac{\text{number of successful outcomes}}{\text{number of possible outcomes}}$$

$$ relative frequency = \frac{\text{number of successful outcomes}}{\text{number of trials}}$$

==== Binomial Random Variable ====

each trial has boolean result, ie, success or failure

trial results are independent

fixed number of trials

same probability in each trial

==== Geometric Random Variable ====

number of trials until success


==== Binomial Distribution ====

uses factorial in the formula

is discreet function

==== Geometric Distribution ====

right-skewed
==== Random Variable: Binomial vs Geometric ====

|                                | Binomial | Geometric |
| each trial has boolean result  |    x     |     x     |
| trial results are independent  |    x     |     x     | 
| same probability in each trial |    x     |     x     |
| fixed number of trials         |    x     |    not    |

Examples

Binomial: How many sixes in 12 rolls of the die?

Geometric: How many rolls of a die until one six?

===== Distribution =====

Look at past events and organize them into patterns which tell a story and 
allow us to understand how, when, and why things happen.

The graph of a distribution is a curve.

There are several kinds of distributions.

discrete vs continuous


distributions
  * linear growth, errors, offsets
    * normal, Gaussian
  * exponential growth, prices, incomes, populations
    * log-normal, a single quantity whose log is normally distributed
    * Pareto, a single quantity whose log is exponentially distributed
  * uniformly distributed
    * discrete uniform distribution, for a finite set of values, coin, die
    * continuous uniform distribution, for continously distributed values
  * Bernoulli trials
    * Bernoulli distribution (success/failure, yes/no)
    * Binomial distribution, ?
    * negative binomial distribution, 
    * geometric distribution, number of failures before the first success
  * categorical outcomes (events with K possible outcomes)
  * Poisson process (events that occur independently with a given rate)
  * absolute values of vectors with normally distributed components
  * normally distributed quantities opereated with sum of squares
  * as conjugate prior distribution in Bayesian inference
  * some specialized applications
  * 

Naive Bayes Classifier in AI is based on Bayes' Theorem in probability theory.


{{https://en.wikipedia.org/wiki/Probability_distribution | 
Wikipedia: Probability distribution}}

===== Normal Distribution =====

aka Bell Curve or Gauss Distribution.


==== Bell Curve ====
{{ ::bell_curve.png|}}

The _normal distribution_ is given by the equation
$$e^{\frac{-x^2}{2}}$$

When we actually have input data, we will use this equation.
$$y = \frac{1}{\sigma \sqrt[]{2\pi }}e^{\frac{(x-\mu )^2}{2\sigma ^2}}$$

Characteristics:
  * The curve is symmetric about the y axis.
  * The center portion is a convex parabola and has one maximum point.
  * To either side lies an inflection point where the line becomes a concave parabola.
  * The line stretches to the left and right, approaching the limit of zero.
  * The area under the curve totals 1.0, the total probability of any prediction.


==== Median ====
==== Standard Deviation ====


Salmon Khan uses problems from ck12.org open source flex book: AP Statistics

Empirical Rule:  68 - 95 - 99.7
  * Values within plus-or-minus 1 standard deviation account for 68% of the universe.
  * Values within plus-or-minus 2 standard deviation account for 95% of the universe.
  * Values within plus-or-minus 3 standard deviation account for 99.7% of the universe.

==== Standard Normal Distribution ====

a distribution where

$$\mu = 0 \text{ and } \sigma = 1$$

z-score = number of standard deviations away from the mean

$$ \text{z-score} = \frac{x - \mu}{\sigma}$$

allows comparison of values on different scales and distributions, like LSAT and MCAT

Standard Normal Table, aka z-table - a table based on the Standard Normal Distribution

gives cumulative probability for any z-score

where cumulative probability is the area under the curve to the left of the z-score

How to find the cumulative probability of a value in the distribution. 
  * Calculate the z-score
  * Look up the z-score in the standard normal table

===== Normal vs Pareto =====

Walter Scheidel: The Great Leveler
https://press.princeton.edu/books/paperback/9780691183251/the-great-leveler
the only things that can level wealth inequality are:
war
revolution
state collapse
plague

nomal
iq
conscientiousness
openness

pareto
productivity
creative output

Price's Law
Derek De Sola Price
1960s study: a vanishingly small proportion of scientests operating in a given domain, produce half the output
a tiny number of people produce almost all of everything
aka square root law 
The square root of a number of people in a domain produce half the output
if you have 10 employees, 3 of them produce half the output
if you have 100 employees, 10 of them produce half the output
if you have 10,000 employees, 100 of them produce half the output

The Matthew Principle
From those who have everything, more will be given.
From those who have nothing, all will be taken.
(An economic principle, copying language from the new testament.)

iq, conscientiousness, openness are normally distributed and are good predictors of long-term performance
but creative output is NOT normally distributed

2017 Personality 21: Biology & Traits: Performance Prediction
https://www.youtube.com/watch?v=Q7GKmznaqsQ&t=0s

2017 Personality 18: Biology & Traits: Openness, Intelligence, Creativity
https://www.youtube.com/watch?v=D7Kn5p7TP_Y&t=5393s

[[https://www.youtube.com/watch?v=fjs2gPa5sD0 |
YouTube: Jordan Petersen - IQ and the Job Market]]

[[https://www.youtube.com/watch?v=jSo5v5t4OQM | 
YouTube: Jordan Peterson - Controversial Facts about IQ]]

[[https://www.youtube.com/watch?v=h02w5E7FGlY |
YouTube: Jordan Peterson - The Big IQ Controversy]]


The SAT, GRE, the LSAT, all of those are IQ tests.
They are more crystallized than fluid.

IQ 
145 to 160 - to be the best.  1 in 10,000
116 to 130  - 95-86 percentile
115-110 - 85-73 percentile
below 87 - no jobs
below 83 -  10% of pop. , cannot join the army


[[https://www.youtube.com/watch?v=7HRmfIEWtyo |
YouTube: Udacity: IQ score distribution]]

{{ :iq_normal_distribution.png?400|}}


[[Python Programming for Statistics and Probability]]

[[https://docs.google.com/spreadsheets/d/1EOx-bmlxPC3ALXfaJTQ83ml8lPFt5zJI4MQN8Hp1d5g/edit#gid=1427780055
| spreadsheet of usa deaths 2015 to 2020]]