Data Science: Hypothesis Testing -

Agenda

• Sampling distribution

• Central Limit Theorem

• Confidence intervals

• Hypothesis Formulation

• Null and Alternative Hypothesis

• Type I and Type II Errors

• Hypothesis Testing

– One-tailed v/s two-tailed test

– Test of mean

– Test of proportion

– Test of variance

• Examples

Concepts of sampling distribution

• Why do we need sampling?

• Analyse the sample and make inferences about the population

• Sample statistic vs population parameter

• Sampling distribution – distribution of a particular sample statistic of all possible samples that can be drawn from a population – sampling distribution of the mean

Sampling Distribution: CLT

• If n samples are drawn from a population that has a mean µ and standard deviation σ:

• The sampling distribution follows a normal distribution with:

• Mean: µ

• Standard Deviation: σ / √n (also c/a Standard Error)

• The corresponding z-score transformation is:

If the population is normal, this holds true even for smaller sample sizes.

• However, if the population is not normal, this holds true for sufficiently large sample sizes.

Central Limit Theorem

• “Sampling Distribution of the mean of any independent random variable will be normal”

• This applies to both discrete and continuous distributions.

• The random variable should have a well defined mean and variance (standard deviation).

• Applicable even when the original variable is not normally distributed.

• Assumptions:

• The data must be randomly sampled.

• The samples values must be independent of each other.

• The 10% condition: When the sample is drawn without replacement, the sample size n, should be no more than 10% of the population.

– In general, a sample size of 30 is considered sufficient.

• The sample size must be sufficiently large.

– If the population is skewed, pretty large sample size is needed.

– For a symmetric population, even small samples are acceptable.

Central Limit Theorem (contd.)

Assume a dice is rolled in sets of 4 trials and the faces are

This is repeated for a month (30 days)

Null and Alternative Hypothesis

All statistical conclusions are made in reference to the null hypothesis.

We either reject the null hypothesis or fail to reject the null hypothesis; we do not accept the null hypothesis.

From the start, we assume the null hypothesis to be true, later the assumption is rejected or we fail to reject it.

• When we reject the null hypothesis, we can conclude that the alternative hypothesis is supported.

• If we fail to reject the null hypothesis, it does not mean that we have proven the null hypothesis is true.

– Failure to reject the null hypothesis does not equate to proving that it is true.

– It just holds up our assumption or the status quo.

Type of hypothesis tests

• Single sample or two or more samples

• One tailed or two tailed

• Tests of mean, proportion or variance

Example problem - Single Sample z – test of mean

• You are the manager of a fast food restaurant. You want to determine if the population mean waiting time has changed from the 4.5 minutes. You can assume that the population standard deviation is 1.2 minutes. You select a sample of 25 orders in an hour. Sample mean is 5.1 minutes. Use the relevant hypothesis test to determine if the population mean has changed from the past value of 4.5.