np.float64(0.03973904581390475)
Mallory Wang
2024-07-23
So far, we’ve talked about constructing confidence intervals and performing hypothesis tests for both population proportions and population means.
One crucial thing to note is that everything we’ve done has been in the context of a single population
Sometimes, as Data Scientists, we may want to test claims about the differences between two populations
E.g. Is the average monthly income in Santa Barbara different from the average monthly income in San Francisco?
E.g. Is the proportion of people who test positive for a disease in one country different than the proportion that test positive in a second country?
Statistically: we are imagining two populations, Population 1 and Population 2, governed by parameters \(\theta_1\) and \(\theta_2\), respectively, and trying to test claims about the relationship between \(\theta_1\) and \(\theta_2\).
The trick Statisticians use is to think in terms of the difference \(\mu_1 - \mu_2.\)
The reason we do this is because we have now effectively reduced our two-parameter problem into a one-parameter problem, involving only the parameter \(\delta := \mu_1 - \mu_2\).
Now, we will need a point estimator of \(\delta\). Just as before, we used \(\bar{X}\) to estimate \(\mu\), we now use \(\bar{X}_2-\bar{X}_1\) to estimate \(\mu_1 - \mu_2\).
We will ultimately need access to the sampling distribution of \(\bar{X}_1-\bar{X}_2\).
Just as before, we have to check conditions to determine the sampling distribution of \(\bar{X}_1-\bar{X}_2\):
Independence within each sample: The observations within each sample are independent (e.g. we have a random sample from each of the two populations).
Independence between the samples: he two samples are independent of one another such that observations in one sample tell us nothing about the observations in the other sample (and vice versa).
Normality: When the sample sizes are small, we require that the observations in each sample come from a normally distributed population. We can relax this condition more and more for larger and larger sample sizes. (You need to check normality for both groups.)
When the conditions are satisfied, the sampling distribution of \(\bar{X}_1-\bar{X}_2\) will be nearly normal with mean \[\mu_1 -\mu_2\] and standard error \[SE = \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\]
This is only if we know the population standard deviation \(\sigma_1\) and \(\sigma_2\), which is unlikely. Instead, we can use the sample standard deviations \(s_1\) and \(s_2\) so we have \[SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]
We will use technology to calculate the degrees of freedom. When we can’t use technologies, we will use the smaller of \(n_1-1\) and \(n_2-1\) for the degrees of freedom.
We will still follow the 5 steps to hypothesis test:
Step 1: Write the hypothesis statements
Step 2: Check conditions for determining the distribution of your sample statistic
Step 3: Calculate test statistic
Step 4: Use test statistic to determine decision of test (either using comparison to critical value or calculate p-value)
Step 5: Write a conclusion in the context of the problem
The null hypothesis is
\[H_0: \mu_1-\mu_2 = 0\] The alternative is one of
\[H_A: \mu_1-\mu_2 < 0\] \[H_A: \mu_1-\mu_2 \neq 0\] \[H_A: \mu_1-\mu_2 > 0\]
Independence within each sample
Independence between samples
Normality
Similar to how we determined our test statistic before, with the generic form:
\[\frac{\text{sample statistic} - \text{null value}}{\text{standard error}}\]
In this situation, we have:
\[\text{TS} = \frac{\bar{x}_1-\bar{x}_2 - 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]
And similar to the one mean problem, this test statistic follows a \(t\) distribution with degrees of freedom calculated from technology.
As we did with one mean, the decision will be based on the kind of test (two sided, lower tail, or greater tail tests). We will calculate the p-value to measure the strength of the evidence against the null hypothesis, comparing it to the significance level, \(\alpha\).
Finally, we write a conclusion in the context of the problem which include our 5 elements:
significance level,
decision of the test,
as a function of the null hypothesis. Then we write an interpretation that is
not too strong, and
in affirmation or not in affirmation of the alternative hypothesis.
Worked-Out Example 1
Gaucho Gourmande has two locations: one in Goleta and one in Santa Barbara. The owner would like to determine whether the average revenue generated by the two locations are equal or not. To that end, he computes the net revenue generated by the Goleta location over 30 days and also computes the net revenue generated by the Santa Barbara location over 35 days (assume all of the necessary independence conditions hold), and produced the following information:
\[\begin{array}{r|cc} & \text{Sample Average} & \text{Sample Standard Deviation} \\ \hline \textbf{Goleta} & \$13 & \$3.45 \\ \textbf{Santa Barbara} & \$15 & \$4.23 \end{array}\]
Test the owner’s claims at an \(\alpha = 0.05\) level of significance, using a two-sided alternative.
Our first step should be to figure out what “Population 1” and “Population 2” are in the context of the problem.
Let “Goleta Location” be Population 1 and “Santa Barbara Location” be Population 2.
In this way, \[ \overline{X} = 13; \quad s_X = 3.45; \quad \overline{Y} = 15; \quad s_Y = 4.23 \]
Additionally, \(n_1 = 30\) and \(n_2 = 35\).
Now, let’s compute the value of the test statistic. \[ \mathrm{TS} = \frac{\overline{Y} - \overline{X}}{\sqrt{ \frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}}} = \frac{15 - 13}{\sqrt{\frac{3.45^2}{30} + \frac{4.23^2}{35} }} = 2.10 \]
We should next figure out the degrees of freedom: \[\begin{align*} \mathrm{df} & = \mathrm{round}\left\{ \frac{ \left[ \left( \frac{s_X^2}{n_1} \right) + \left( \frac{s_Y^2}{n_2} \right) \right]^2 }{ \frac{\left( \frac{s_X^2}{n_1} \right)^2}{n_1 - 1} + \frac{\left( \frac{s_Y^2}{n_2} \right)^2}{n_2 - 1} } \right\} \\ & = \mathrm{round}\left\{ \frac{ \left[ \left( \frac{3.45^2}{30} \right) + \left( \frac{4.23^2}{35} \right) \right]^2 }{ \frac{\left( \frac{3.45^2}{30} \right)^2}{30 - 1} + \frac{\left( \frac{4.23^2}{35} \right)^2}{35 - 1} } \right\} = 63 \end{align*}\]
At this point, we could either proceed using critical values or using p-values.
Let’s use p-values, for practice.
Our p-value is computed as
This is below our level of significance \(\alpha = 0.05\) meaning we would reject the null.
If we wanted to instead use critical values:
At a 5% level of significance, there was sufficient evidence to reject the owner’s claims that the revenue generated by the two locations are equal, in favor of the alternative that the revenue generated by the two locations are not equal.
Unsurprisingly, we can adapt the above procedure to account for one-sided alternatives as well.
For instance, suppose we wish to test \[ \left[ \begin{array}{rr} H_0: & \mu_1 = \mu_2 \\ H_A: & \mu_1 < \mu_2 \end{array} \right.\]
Again, we rephrase things as: \[ \left[ \begin{array}{rr} H_0: & \mu_2 - \mu_1 = 0 \\ H_A: & \mu_2 - \mu_1 > 0 \end{array} \right.\] which is now a familiar upper-tailed test on \(\delta = \mu_2 - \mu_1\) and \(\mu_0 = 0.\)
Specifically, we would take the same test statistic (which would still follow the same distribution under the null) and use the decision rule \[ \texttt{decision}(\mathrm{TS}) = \begin{cases} \texttt{reject } H_0 & \text{if } \mathrm{TS} > c \\ \texttt{fail to reject } H_0 & \text{otherwise}\\ \end{cases} \] where \(c\) is the appropriate quantile of the approximate t distribution (with degrees of freedom given by the Satterthwaite Approximation).
A similar result holds for the lower-tailed test- I encourage you to work it out on your own.
As with the confidence interval for one mean, we had the generic form
\[\hat{\theta} \pm c \cdot \text{standard error}\]
So confidence interval for \(\mu_1-\mu_2\) is \[\bar{x}_1-\bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]
Worked-Out Example 2
Since public universities are subsidized by state governments, it is usually less expensive to attend a public university in your home state than a private university. Is this still true when you must pay out-of-state tuition? A random sample of 75 private universities had an average total cost of $42,336 with standard deviation of $14,004. A similar random sample of 80 public universities had an average out-of-state total cost of $32,240 with standard deviation $7,730. Calculate the 90% confidence interval for the difference in population mean total cost for students at private schools and the population mean total cost for out-of-state students at public schools.
These are both random samples so the observations with each sample are independent of one another.
There is no reason why the two samples would not be independent between each other.
Given \(n_1 = 75 \geq 30\) and \(n_2 = 80 \geq 30\), we have large enough sample to relax normality assumption and use CLT.
To calculate the degrees of freedom:
\[df =\min(n_1-1,n_2-1) = \min(74, 79) = 74\] - Calculate \(t^*\) critical value given \(\alpha = 0.1\) which is 1.6657
\[\bar{x}_1-\bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]
\[(42336 - 32240) \pm 1.6657 \sqrt{\frac{14004^2}{75}+\frac{7730^2}{80}}\]
\[10096 \pm 3054\]
\[[\$7,042, \$13,150]\]
We are 90% confident that the true difference in population mean total cost for students at private schools and population mean total cost for out-of-state students at public schools is between $7,042 and $13,150.