Originally by “‘piccolojunior”’ on the College Conﬁdential forums; reformatted/reorganized/etc by Dillon Cower. Comments/suggestions/corrections: email@example.com
¯ • Mean = x (sample mean) = µ (population mean) = sum of all elements ( number of elements (n) in a set = of center.
x) divided by
. The mean is used for quantitative data. It is a measure
• Median: Also a measure of center; better ﬁts skewed data. To calculate, sort the data points and choose the middle value. • Variance: For each value (x) in a set of data, take the difference between it and the mean (x − µ ¯ or x − x ), square that difference, and repeat for each value. Divide the ﬁnal result by n (number of elements) if you want the population variance (σ2 ), or divide by n − 1 for sample variance (s2 ).
Thus: Population variance = σ2 =
(x−µ)2 . n (x−µ)2 . n
Sample variance = s2 =
(x−¯ )2 x . n−1
... order to interpret all of the data in its simplest form. Descriptive Measures Sample Mean, Sample Variance, and Sample Standard Deviation and an assortment ... formulas are: . Population Mean: mu: Black (2001, p 78). One Sample of size: Black (2001, p 78). Variance: Black (2001, ... ). 2 nd independent sample of size: Black (2001, p 67). Normal Population Mean: Black (2001, p 66). Variance: Black (2001, ...
(x−¯ )2 x . n−1 σ . n
– You can convert a population standard deviation to a sample one like so: s = • Dotplots, stemplots: Good for small sets of data. • Histograms: Good for larger sets and for categorical data. • Shape of a distribution:
– Skewed: If a distribution is skewed-left, it has fewer values to the left, and thus appears to tail off to the left; the opposite for a skewed-right distribution. If skewed right, median mean. – Symmetric: The distribution appears to be symmetrical. – Uniform: Looks like a ﬂat line or perfect rectangle. – Bell-shaped: A type of symmetry representing a normal curve. Note: No data is perfectly normal – instead, say that the distribution is approximately normal.
• Z-score = standard score = normal score = z = number of standard deviations past the mean; used for normal distributions. A negative z-score means that it is below the mean, whereas a x−µ positive z-score means that it is above the mean. For a population, z = σ . For a sample (i.e. when a sample size is given), z =
x−¯ x s
• With a normal distribution, when we want to ﬁnd the percentage of all values less than a certain value (x), we calculate x’s z-score (z) and look it up in the Z-table. This is also the area under the normal curve to the left of x. Remember to multiply by 100 to get the actual percent. For example, look up z = 1 in the table; a value of roughly p = 0.8413 should be found. Multiply by 100 = (0.8413)(100) = 84.13%. – If we want the percentage of all values greater than x, then we take the complement of that = 1 − p. • The area under the entire normal curve is always 1.
• Bivariate data: 2 variables. – Shape of the points (linear, etc.) – Strength: Closeness of ﬁt or the correlation coefﬁcient (r).
Strong, weak, or none. – Whether the association is positive/negative, respectively. • It probably isn’t worth spending the time ﬁnding r by hand. • Least-Squares Regression Line (LSRL): ˆ = a + bX . (hat is important) y • r 2 = The percent of variation in y-values that can be explained by the LSRL, or how well the line ﬁts the data. • Residual = observed − predicted. This is basically how far away (positive or negative) the observed value ( y) for a certain x is from the point on the LSRL for that x. • ALWAYS read what they put on the axes so you don’t get confused. • If you see a pattern (non-random) in the residual points (think residual scatterplot), then it’s safe to say that the LSRL doesn’t ﬁt the data. • Outliers lie outside the overall pattern. Inﬂuential points, which signiﬁcantly change the LSRL (slope and intercept), are outliers that deviate from the rest of the points in the x direction (as in, the x-value is an outlier).
... normal distributions because there is a different normal distribution for each combination of mean and standard deviation. We want to assume that our sample ... data represent a population distribution ... exactly. 2. The normal distribution has several properties. The normal curve is symmetrical about ...
• Exponential regression: ˆ = a b x . (anything raised to x is exponential) y • Power regression: ˆ = a x b . y • We cannot extrapolate (predict outside of the scatterplot’s range) with these. • Correlation DOES NOT imply causation. Just because San Franciscans tend to be liberal doesn’t mean that living in San Francisco causes one to become a liberal. 2
• Lurking variables either show a common response or confound. • Cause: x causes y, no lurking variables. • Common response: The lurking variable affects both the explanatory (x) and response ( y) variables. For example: When we want to ﬁnd whether more hours of sleep explains higher GPAs, we must recognize that a student’s courseload can affect his/her hours of sleep and GPA. • Confounding: The lurking variable affects only the response ( y).
• Studies: They’re all studies, but observational ones don’t impose a treatment whereas experiments do and thus we cannot do anything more than conclude a correlation or tendency (as in, NO CAUSATION) • Observational studies do not impose a treatment. • Experimental studies do impose a treatment. • Some forms of bias: – Voluntary response: i.e. Letting volunteers call in. – Undercoverage: Not reaching all types of people because, for example, they don’t have a telephone number for a survey. – Non-response: Questionnaires which allow for people to not respond. – Convenience sampling: Choosing a sample that is easy but likely non-random and thus biased. • Simple Random Sample (SRS): A certain number of people are chosen from a population so that each person has an equal chance of being selected. • Stratiﬁed Random Sampling: Break the population into strata (groups), then do a SRS on these strata. DO NOT confuse with a pure SRS, which does NOT break anything up. • Cluster Sampling: Break the population up into clusters, then randomly select n clusters and poll all people in those clusters. • In experiments, we must have: – Control/placebo (fake drug) group – Randomization of sample – Ability to replicate the experiment in similar conditions • Double blind: Neither subject nor administrator of treatment knows which one is a placebo and which is the real drug being tested. • Matched pairs: Refers to having each person do both treatments . Randomly select which half of the group does the treatments in a certain order. Have the other half do the treatments in the other order. 3
... data for testing the hypothesis. Use a probability sample or simple random sample technique for both companies; generating random sample purchase dates and ... used to test the null hypothesis, since both groups are independent samples and the standard deviation is unknown. One-tail test ... straight flight and feel they have achieved some degree of success. You, however, are worried about the effect that ...
• Block design: Eliminate confounding due to race, gender, and other lurking variables by breaking the experimental group into groups (blocks) based on these categories, and compare only within each sub-group. • Use a random number table or on your calculator: RandInt(lower bound #, upper bound #, how #’s to generate)
• Probabilities are ≥ 0 and ≤ 1. • Complement = 1 − P(A) and is written P(Ac ).
• Disjoint (aka mutually exclusive) probabilities have no common outcomes. • Independent probabilities don’t affect each other. • P(A and B) = P(A) ∗ P(B) • P(A or B) = P(A) + P(B) − P(A and B) • P(B g i ven A) =
P(A and B) . P(A)
• P(B g i ven A) = P(B) means independence.
• Discrete random variable: Deﬁned probabilities for certain values of x. Sum of probabilities should equal 1. Usually shown in a probability distribution table. • Continuous random variable: Involves a density curve (area under it is 1), and you deﬁne intervals for certain probabilities and/or z-scores. • Expected value = sum of the probability of each possible outcome times the outcome value (or payoff) = P(x 1 ) ∗ x 1 + P(x 2 ) ∗ x 2 + . . . + P(x n ) ∗ x n . • Variance = [(X i − X µ )2 ∗ P(x i )] for all values of x var iance = (X i − X µ )2 P(x i )
• Standard deviation =
• Means of two different variables can add/subtract/multiply/divide. Variances, NOT standard deviations, can do the same. (Square standard deviation to get variance.)
• Binomial distribution: n is ﬁxed, the probabilities of success and failure are constant, and each trial is independent. • p = probability of success • q = probability of failure = 1 − p • Mean = np • Standard deviation = npq, which will only work if the mean (np) is ≥ 10 and nq ≥ 10.
... probability of rejecting a false null hypothesis. Specifically, it is the probability that a randomly selected sample will show that the null hypothesis is false when the null hypothesis ... Power 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative hypothesis, level of significance, test statistic, p value, and statistical significance ...
• Use binompd f (n, p, x) for a speciﬁc probability (exactly x successes).
• Use binomcd f (n, p, x) sums up all probabilities up to x successes (including it as well).
To restate this, it is the probability of getting x or fewer successes out of n trials. – The c in binomcd f stands for cumulative. • Geometric distributions: This distribution can answer two questions. Either a) the probability of getting ﬁrst success on the nth trial, or b) the probability of getting success on ≤ n trials. – Probability of ﬁrst having success on the nth trial = p∗q n−1 . On the calculator: g eomet pd f (p, n).
– Probability of ﬁrst having success on or before the nth trial = sum of the probability of having ﬁrst success on the x trial for every value from 1 to n = pq0 + pq1 + . . . + pq n−1 = n i−1 . On the calculator: g eomet cd f (p, n).
i=1 pq – Mean =
1 p q p2
– Standard deviation =
• A statistic describes a sample. (s, s) • A parameter describes a population. (p, p) ˆ • P is a sample proportion whereas P is a parameter proportion. • Some conditions: – Population size is ≥ 10 * sample size – np and nq must both be ≥ 10 • Variability = spread of data • Bias = accuracy (closeness to true value) ˆ • P = success/size of sample • Mean = ˆ = p p • Standard deviation:
• H0 is the null hypothesis • Ha or H1 is the alternative hypothesis. • Conﬁdence intervals follow the formula: estimator ± margin of error. ¯ • To calculate a Z-interval: x ± z ∗
• The p value represents the chance that we should observe a value as extreme as what our sample gives us (i.e. how ordinary it is to see that value, so that it isn’t simply attributed to randomness).
• If p-value is less than the alpha level (usually 0.05, but watch for what they specify), then the statistic is statistically signiﬁcant, and thus we reject the null hypothesis. • Type I error (α): We reject the null hypothesis when it’s actually true. • Type II error (β): We fail to reject (and thus accept) the null hypothesis when it is actually false. • Power of the test = 1 − β, or our ability to reject the null hypothesis when it is false.
... between the two groups is zero. Significance tests determine the probability that the null hypothesis is true. (We will be considering ... psychologists and engineers are the same but the samples are unrepresentative of their populations because of random ... that affect health? INTRODUCTION TO THE NULL HYPOTHESIS Suppose we drew random samples of engineers and psychologists, administered a self ...
• T-distributions: These are very similar to Z-distributions and are typically used with small sample sizes or when the population standard deviation isn’t known. • To calculate a T-interval. • Degrees of freedom (df) = sample size – 1 = n − 1 • To perform a hypothesis test with a T-distribution: – Calculate your test statistic: t = (as written in the FRQ formulas packet) =
¯ x −µ
statistic − parameter standard deviation of statistic
– Either use the T-table provided (unless given, use a probability of .05 aka conﬁdence level of 95%) or use the T-test on your calculator to get a t ∗ (critical t) value to compare against your t value. – If your t value is larger than t ∗ , then reject the null hypothesis. – You may also ﬁnd the closest probability that ﬁts your df and t value; if it is below 0.05 (or whatever), reject the null hypothesis. • Be sure to check for normality ﬁrst; some guidelines: – If n 15 and n 40, it’s okay. • Two-sample T-test: 6
¯ ¯ x1 − x2
s2 s2 1 + n2 n1 2
– Use the smaller n out of the two sample sizes when calculating the df. – Null hypothesis can be any of the following: ∗ H0 : µ1 = µ2 ∗ H0 : µ1 − µ2 = 0 ∗ H0 : µ2 − µ1 = 0 – Use 2-SampTTest on your calculator. • For two-sample T-test conﬁdence intervals: ¯ ¯ – µ1 µ2 is estimated by ( x 1 − x 2 ) ± t ∗ – Use 2-SampTInt on your calculator.
2 s2 n2
¯ • Remember ZAP TAX (Z for Probability, T for Samples (X )).
• Conﬁdence interval for two proportions: ˆ ˆ – ( p1 − p2 ) ± z ∗
ˆ ˆ p1 q 1 n1
ˆ ˆ p2 q2 ) n2
– Use 2-PropZInt on your calculator. • Hypothesis test for two proportions: – z=
ˆ ˆ p1 − p2 ˆ q( n1 + n1 ) pˆ
– Use 2-PropZTest on your calculator. • Remember: Proportion is for categorical variables.
• Chi-square (χ 2 ): – Used for counted data. – Used when we want to test the independence, homogeneity, and “‘goodness of ﬁt”’ to a distribution. – The formula is: χ 2 =
(observed − expected)2 . expected (row total)(column total) table total
– Degrees of freedom = (r − 1)(c − 1), where r = # rows and c = # columns. – To calculate the expected value for a cell from an observed table: – Large χ 2 values are evidence against the null hypothesis, which states that the percentages of observed and expected match (as in, any differences are attributed to chance).
... n Example. A random sample of n observations is selected from a normal population to test the null hypothesis that µ = 10. Specify the ... . (a). Specify the appropriate null and alternative hypothesis if we wish to test the research hypothesis that the mean GHQ score for all ... variance for the sample of 20 farms are x = 462 and s2 = 9070. ? Specify the null and alternative hypothesis used to determine ...
– On your calculator: For independence/homogeneity, put the 2-way table in matrix A and perform a χ 2 -Test. The expected values will go into whatever matrix they are speciﬁed to go in.
• Regression inference is the same thing as what we did earlier, just with us looking at the a and b in ˆ = a + b x. y