====== Statistics ====== See also [[Probability]], where we try to predict future events based on statistics. Parent [[Statistics and Probability]]. Tell a story: good or bad. Stretch or squeeze the axes. sports stats\\ nba.com Data * collecting * analyzing * presenting Variability * 0,0,0,0,0 * 0,235,17,5,318 Statistical Questions: to answer, you need to collect data with variability Population and Sample * subset * Simple random sample - random number generator, for example * Stratified sample - by age or gender * Clustered sample - randomly choose clusters (classrooms in school), then take the whole cluster * Voluntary - likely to introduce bias, non-random * Convenience - non-random * Block design - variation of stratified sample Bias * voluntary response sampling (non random) * response bias (persons do not want to answer truthfully) * undercoverage (missing certain groups) * convenience sampling * nonresponse (not everyone sampled actually responds) * wording of a survey (influences the answers) Correlation vs causality types of statistical studies * Sample Study - select sample from population * Observational Study - looking for correlation, not causality * Experiment - look for causality, use control group "margin of error" Experiment Design * explanatory variable * response variable * blind * double-blind * triple-blind, people analyzing the data don't know which is which * matched pairs experiment, switch the control and treatment groups after a period of time ? Correlation Coeffecient (r) : Calculated as \begin{align} r = \frac{1}{n-1} \sum \left ( \frac{x_i - \overline{x} }{s_x} \right ) \left ( \frac{y_i - \overline{y} }{s_y} \right ) \end{align} === conventions === \begin{align} y &= actual y \\ \hat{y} &= predicted y \\ \overline{y} &= mean of all y's \\ \end{align} === Coefficient of Determination === Represented by r-squared. What percentage of the total variation in y is described by the variation in x? What percentage of the variation described by the line? How good is the li{ne as a predictor? Total variation is the squared error of y from the mean of y. \begin{align} r^2 &= 1 - \frac{SE_{LINE}}{SE\bar{y}} \\ SE_{LINE} &= \text{variation described by the line} \\ SE\bar{y} &= \text{total variation in y from the mean} \\ \end{align} SSE_{LINE} && \text{variation described by the line} Squared error for the line, the lower the value, the better the fit. Coefficient of determination normalizes the squared error by making it a percentage and a probability. The smaller the squared error, the higher the coefficient, the greater the probabiliy the line is a good fit. ? Root mean square error (RMSD) : Standard deviation of residuals ? Covariance : expected value can be * arithmetic mean * probability weighted sum, or probability weighted integral (in a continuous distribution) ? Coefficient : can mean multiplier, factor, scalar ===== Confidence Intervals ===== Khan Unit: Confidence Intervals 95% confidence based on 2 standard deviations the margin of error is 2 standard deviations the confidence interval is the $\hat{p}$ plus or minus the margin of error we want to know the proportion of the population that favors a candidate sample the population * N = 100,000 * n = 100 * take multiple samples * calc the proportion for all of the samples * assume the mean of the sample proportions = the proportion of the population * assume the sample proportions are normally distributed, the sampling distribution * the standard deviation can be calculated by a formula \begin{align} \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \\ \end{align} calculate $\hat{p}$ as the proportion of the sample that favors your candidate from that $\hat{p}$, n and 95%, calculate the margin of error === ratio, proportion, percentage, percentage proportion === ratio - comparison of two quantities part-to-part ratio proportion - equality of two ratios, can be used for interpolation percentage - a fraction with 100 in the denominator percentage proportion - an equality of two ratios where the second ratio has a denominator of 100 ==== discrete, Binomial, two-point, bernoulli distributions ==== binomial - the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: * success (with probability p) or * failure (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance. two-point - a random variable which can take 1 of 2 possible values bernoulli - a two-point distributions where the two possible outcomes are 0 and 1 * a random variable which takes the value 0 or 1 * success, yes, true, one * failure, no, false, zero ==== Random Variables ==== variable x = a single value random variable X = a number of values, results of experiments, to calculate P(X) ===== Distribution ===== median, range, IQR box-and-whisker plot box plot vs dot plot ===== Frequency Distribution ===== dot plot mean absolute deviation (MAD) central tendancy * mean, aka arithmetic mean, $\mu$ = average * median = middle number of rank ordered set * mode = most frequent number spread * range = max - min * interquartile range (IQR) = diff between top half median and bottom half median * variance, sigma squared, $\sigma^2$ = $\frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N}$ * standard deviation, sigma, $\sigma$ = $\sqrt{\sigma^2}$ with a tight distribution, mean and standard deviation work best with a skewed distribution, median and interquartile range work best do algebra on the variance formula \begin{align} \mu &= \frac{\sum_{x=1}^{N}(x_i)}{N} &&\text{mean formula}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i-\mu)^2}{N} &&\text{variance formula 1}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \mu^2 &&\text{variance formula 2}\\ \sigma^2 &= \frac{\sum_{x=1}^{N}(x_i)^2}{N} - \frac{\left [\sum_{x=1}^{N}(x_i)\right ]^2}{N^2} &&\text{variance formula 3}\\ \end{align} === Population vs Sample === We can work with an entire population.\\ Or we can work with a sample from the population.\\ The math is the same except for two things. - The symbols are different. * For a population, we use * $\mu$ for the mean, * $\sigma$ for standard deviation, and * $N$ for the number of points. * For a sample, we use * $\bar{x}$ for the mean, * $s_x$ for standard deviation, and * $n$ for the number of points. - The divisor in the standard deviation formula is different. * For a population, we divide by N. * For a sample, we divide by n-1. ==== Relative Frequency ==== y axis is percent instead of raw frequency count ==== Density Curve ==== Make histogram bars more and more narrow until the top becomes a line. The data points can take on any value in a continuum, as opposed to being lumped into coarse buckets. Area under the curve is 100%. The curve will never go negative. Measure the area under the density curve between two values, to get the percentage of data points falling between those two values. This can sometimes be estimated by calculating the area of the rectangle. You cannot calculate the percent of a single value, because there is no rectangle. The line has no width. === Symmetric Density Curve === mean and median are equal, both cut the area under the curve in half in the bell curve, the mode is also equal to the mean and median picture a bimodal symmetric curve the area under the left half is equal to the area under the right half === asymmetric density curve === aka skewed mean will be towards the long tail from the median long tail to the right => right-skewed distribution and vice versa ==== Probability Distribution ==== Distribution\\ Frequency Distribution\\ Binomial Distribution\\ Percentile\\ Density Curve\\ Probability Distribution\\ Cumulative Probability Distribution\\ frequency - table of actual results relative frequency - division to create percentage probability = theoretical probability = relative frequency of the entire population $$ probability = \frac{\text{number of successful outcomes}}{\text{number of possible outcomes}}$$ $$ relative frequency = \frac{\text{number of successful outcomes}}{\text{number of trials}}$$ ==== Binomial Random Variable ==== each trial has boolean result, ie, success or failure trial results are independent fixed number of trials same probability in each trial ==== Geometric Random Variable ==== number of trials until success ==== Binomial Distribution ==== uses factorial in the formula is discreet function ==== Geometric Distribution ==== right-skewed ==== Random Variable: Binomial vs Geometric ==== | | Binomial | Geometric | | each trial has boolean result | x | x | | trial results are independent | x | x | | same probability in each trial | x | x | | fixed number of trials | x | not | Examples Binomial: How many sixes in 12 rolls of the die? Geometric: How many rolls of a die until one six? ===== Distribution ===== Look at past events and organize them into patterns which tell a story and allow us to understand how, when, and why things happen. The graph of a distribution is a curve. There are several kinds of distributions. discrete vs continuous distributions * linear growth, errors, offsets * normal, Gaussian * exponential growth, prices, incomes, populations * log-normal, a single quantity whose log is normally distributed * Pareto, a single quantity whose log is exponentially distributed * uniformly distributed * discrete uniform distribution, for a finite set of values, coin, die * continuous uniform distribution, for continously distributed values * Bernoulli trials * Bernoulli distribution (success/failure, yes/no) * Binomial distribution, ? * negative binomial distribution, * geometric distribution, number of failures before the first success * categorical outcomes (events with K possible outcomes) * Poisson process (events that occur independently with a given rate) * absolute values of vectors with normally distributed components * normally distributed quantities opereated with sum of squares * as conjugate prior distribution in Bayesian inference * some specialized applications * Naive Bayes Classifier in AI is based on Bayes' Theorem in probability theory. {{https://en.wikipedia.org/wiki/Probability_distribution | Wikipedia: Probability distribution}} ===== Normal Distribution ===== aka Bell Curve or Gauss Distribution. ==== Bell Curve ==== {{ ::bell_curve.png|}} The _normal distribution_ is given by the equation $$e^{\frac{-x^2}{2}}$$ When we actually have input data, we will use this equation. $$y = \frac{1}{\sigma \sqrt[]{2\pi }}e^{\frac{(x-\mu )^2}{2\sigma ^2}}$$ Characteristics: * The curve is symmetric about the y axis. * The center portion is a convex parabola and has one maximum point. * To either side lies an inflection point where the line becomes a concave parabola. * The line stretches to the left and right, approaching the limit of zero. * The area under the curve totals 1.0, the total probability of any prediction. ==== Median ==== ==== Standard Deviation ==== Salmon Khan uses problems from ck12.org open source flex book: AP Statistics Empirical Rule: 68 - 95 - 99.7 * Values within plus-or-minus 1 standard deviation account for 68% of the universe. * Values within plus-or-minus 2 standard deviation account for 95% of the universe. * Values within plus-or-minus 3 standard deviation account for 99.7% of the universe. ==== Standard Normal Distribution ==== a distribution where $$\mu = 0 \text{ and } \sigma = 1$$ z-score = number of standard deviations away from the mean $$ \text{z-score} = \frac{x - \mu}{\sigma}$$ allows comparison of values on different scales and distributions, like LSAT and MCAT Standard Normal Table, aka z-table - a table based on the Standard Normal Distribution gives cumulative probability for any z-score where cumulative probability is the area under the curve to the left of the z-score How to find the cumulative probability of a value in the distribution. * Calculate the z-score * Look up the z-score in the standard normal table ===== Normal vs Pareto ===== Walter Scheidel: The Great Leveler https://press.princeton.edu/books/paperback/9780691183251/the-great-leveler the only things that can level wealth inequality are: war revolution state collapse plague nomal iq conscientiousness openness pareto productivity creative output Price's Law Derek De Sola Price 1960s study: a vanishingly small proportion of scientests operating in a given domain, produce half the output a tiny number of people produce almost all of everything aka square root law The square root of a number of people in a domain produce half the output if you have 10 employees, 3 of them produce half the output if you have 100 employees, 10 of them produce half the output if you have 10,000 employees, 100 of them produce half the output The Matthew Principle From those who have everything, more will be given. From those who have nothing, all will be taken. (An economic principle, copying language from the new testament.) iq, conscientiousness, openness are normally distributed and are good predictors of long-term performance but creative output is NOT normally distributed 2017 Personality 21: Biology & Traits: Performance Prediction https://www.youtube.com/watch?v=Q7GKmznaqsQ&t=0s 2017 Personality 18: Biology & Traits: Openness, Intelligence, Creativity https://www.youtube.com/watch?v=D7Kn5p7TP_Y&t=5393s [[https://www.youtube.com/watch?v=fjs2gPa5sD0 | YouTube: Jordan Petersen - IQ and the Job Market]] [[https://www.youtube.com/watch?v=jSo5v5t4OQM | YouTube: Jordan Peterson - Controversial Facts about IQ]] [[https://www.youtube.com/watch?v=h02w5E7FGlY | YouTube: Jordan Peterson - The Big IQ Controversy]] The SAT, GRE, the LSAT, all of those are IQ tests. They are more crystallized than fluid. IQ 145 to 160 - to be the best. 1 in 10,000 116 to 130 - 95-86 percentile 115-110 - 85-73 percentile below 87 - no jobs below 83 - 10% of pop. , cannot join the army [[https://www.youtube.com/watch?v=7HRmfIEWtyo | YouTube: Udacity: IQ score distribution]] {{ :iq_normal_distribution.png?400|}} [[Python Programming for Statistics and Probability]] [[https://docs.google.com/spreadsheets/d/1EOx-bmlxPC3ALXfaJTQ83ml8lPFt5zJI4MQN8Hp1d5g/edit#gid=1427780055 | spreadsheet of usa deaths 2015 to 2020]]