PSTAT 5A: Lecture 19

Two-Sample Tests

Mallory Wang

2024-07-23

Multiple Populations

  • So far, we’ve talked about constructing confidence intervals and performing hypothesis tests for both population proportions and population means.

  • One crucial thing to note is that everything we’ve done has been in the context of a single population

  • Sometimes, as Data Scientists, we may want to test claims about the differences between two populations

    • E.g. Is the average monthly income in Santa Barbara different from the average monthly income in San Francisco?

    • E.g. Is the proportion of people who test positive for a disease in one country different than the proportion that test positive in a second country?

Two Populations

Two Populations

  • Statistically: we are imagining two populations, Population 1 and Population 2, governed by parameters \(\theta_1\) and \(\theta_2\), respectively, and trying to test claims about the relationship between \(\theta_1\) and \(\theta_2\).

    • For example, we could consider two populations with means \(\mu_1\) and \(\mu_2\), respectively, and try to make claims about whether or not \(\mu_1\) and \(\mu_2\) are equal.
  • The trick Statisticians use is to think in terms of the difference \(\mu_1 - \mu_2.\)

    • For example, if our null hypothesis is that \(\mu_1 = \mu_2\), this can be rephrased as \(H_0: \ \mu_1 - \mu_2 = 0\).

Two Populations

  • The reason we do this is because we have now effectively reduced our two-parameter problem into a one-parameter problem, involving only the parameter \(\delta := \mu_1 - \mu_2\).

  • Now, we will need a point estimator of \(\delta\). Just as before, we used \(\bar{X}\) to estimate \(\mu\), we now use \(\bar{X}_2-\bar{X}_1\) to estimate \(\mu_1 - \mu_2\).

Two Populations

  • We will ultimately need access to the sampling distribution of \(\bar{X}_1-\bar{X}_2\).

  • Just as before, we have to check conditions to determine the sampling distribution of \(\bar{X}_1-\bar{X}_2\):

  1. Independence within each sample: The observations within each sample are independent (e.g. we have a random sample from each of the two populations).

  2. Independence between the samples: he two samples are independent of one another such that observations in one sample tell us nothing about the observations in the other sample (and vice versa).

  3. Normality: When the sample sizes are small, we require that the observations in each sample come from a normally distributed population. We can relax this condition more and more for larger and larger sample sizes. (You need to check normality for both groups.)

Sampling Distribution of \(\bar{X}_1-\bar{X}_2\)

When the conditions are satisfied, the sampling distribution of \(\bar{X}_1-\bar{X}_2\) will be nearly normal with mean \[\mu_1 -\mu_2\] and standard error \[SE = \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}\]

Sampling Distribution of \(\bar{X}_1-\bar{X}_2\)

This is only if we know the population standard deviation \(\sigma_1\) and \(\sigma_2\), which is unlikely. Instead, we can use the sample standard deviations \(s_1\) and \(s_2\) so we have \[SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]

We will use technology to calculate the degrees of freedom. When we can’t use technologies, we will use the smaller of \(n_1-1\) and \(n_2-1\) for the degrees of freedom.

Hypothesis Test for \(\mu_1-\mu_2\)

Hypothesis Test for \(\mu_1-\mu_2\)

We will still follow the 5 steps to hypothesis test:

Step 1: Write the hypothesis statements

Step 2: Check conditions for determining the distribution of your sample statistic

Step 3: Calculate test statistic

Step 4: Use test statistic to determine decision of test (either using comparison to critical value or calculate p-value)

Step 5: Write a conclusion in the context of the problem

Step 1: Write the hypothesis statements

The null hypothesis is

\[H_0: \mu_1-\mu_2 = 0\] The alternative is one of

\[H_A: \mu_1-\mu_2 < 0\] \[H_A: \mu_1-\mu_2 \neq 0\] \[H_A: \mu_1-\mu_2 > 0\]

Step 2: Check conditions for your sampling distribution

  1. Independence within each sample

  2. Independence between samples

  3. Normality

Step 3: Calculate test statistic

Similar to how we determined our test statistic before, with the generic form:

\[\frac{\text{sample statistic} - \text{null value}}{\text{standard error}}\]

In this situation, we have:

\[\text{TS} = \frac{\bar{x}_1-\bar{x}_2 - 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

And similar to the one mean problem, this test statistic follows a \(t\) distribution with degrees of freedom calculated from technology.

Step 4: Make decision on test and Step 5: Write a conclusion

  • As we did with one mean, the decision will be based on the kind of test (two sided, lower tail, or greater tail tests). We will calculate the p-value to measure the strength of the evidence against the null hypothesis, comparing it to the significance level, \(\alpha\).

  • Finally, we write a conclusion in the context of the problem which include our 5 elements:

    1. significance level,

    2. decision of the test,

    3. as a function of the null hypothesis. Then we write an interpretation that is

    4. not too strong, and

    5. in affirmation or not in affirmation of the alternative hypothesis.

Worked-Out Example

Worked-Out Example 1

Gaucho Gourmande has two locations: one in Goleta and one in Santa Barbara. The owner would like to determine whether the average revenue generated by the two locations are equal or not. To that end, he computes the net revenue generated by the Goleta location over 30 days and also computes the net revenue generated by the Santa Barbara location over 35 days (assume all of the necessary independence conditions hold), and produced the following information:

\[\begin{array}{r|cc} & \text{Sample Average} & \text{Sample Standard Deviation} \\ \hline \textbf{Goleta} & \$13 & \$3.45 \\ \textbf{Santa Barbara} & \$15 & \$4.23 \end{array}\]

Test the owner’s claims at an \(\alpha = 0.05\) level of significance, using a two-sided alternative.

Solutions

  • Our first step should be to figure out what “Population 1” and “Population 2” are in the context of the problem.

  • Let “Goleta Location” be Population 1 and “Santa Barbara Location” be Population 2.

    • It is perfectly acceptable to swap these two, but just be sure you remain consistent throughout the problem!
    • Also, I will expect you to explicitly write out your definitions of the populations (like above), even if the problem doesn’t explicitly ask you to do so.
  • In this way, \[ \overline{X} = 13; \quad s_X = 3.45; \quad \overline{Y} = 15; \quad s_Y = 4.23 \]

  • Additionally, \(n_1 = 30\) and \(n_2 = 35\).

Solutions

  • Now, let’s compute the value of the test statistic. \[ \mathrm{TS} = \frac{\overline{Y} - \overline{X}}{\sqrt{ \frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}}} = \frac{15 - 13}{\sqrt{\frac{3.45^2}{30} + \frac{4.23^2}{35} }} = 2.10 \]

  • We should next figure out the degrees of freedom: \[\begin{align*} \mathrm{df} & = \mathrm{round}\left\{ \frac{ \left[ \left( \frac{s_X^2}{n_1} \right) + \left( \frac{s_Y^2}{n_2} \right) \right]^2 }{ \frac{\left( \frac{s_X^2}{n_1} \right)^2}{n_1 - 1} + \frac{\left( \frac{s_Y^2}{n_2} \right)^2}{n_2 - 1} } \right\} \\ & = \mathrm{round}\left\{ \frac{ \left[ \left( \frac{3.45^2}{30} \right) + \left( \frac{4.23^2}{35} \right) \right]^2 }{ \frac{\left( \frac{3.45^2}{30} \right)^2}{30 - 1} + \frac{\left( \frac{4.23^2}{35} \right)^2}{35 - 1} } \right\} = 63 \end{align*}\]

Solutions

  • At this point, we could either proceed using critical values or using p-values.

  • Let’s use p-values, for practice.

  • Our p-value is computed as

import scipy.stats as sps
2*sps.t.cdf(-2.10, 63)
np.float64(0.03973904581390475)
  • This is below our level of significance \(\alpha = 0.05\) meaning we would reject the null.

  • If we wanted to instead use critical values:

-sps.t.ppf(0.05, 63)
np.float64(1.669402221706813)
  • This means our critical value is 1.67; since \(|\mathrm{TS}| = |2.10| = 2.10 > 1.67\), we would again reject at an \(\alpha = 0.05\) level of significance.

Solutions

At a 5% level of significance, there was sufficient evidence to reject the owner’s claims that the revenue generated by the two locations are equal, in favor of the alternative that the revenue generated by the two locations are not equal.

Extensions

  • Unsurprisingly, we can adapt the above procedure to account for one-sided alternatives as well.

  • For instance, suppose we wish to test \[ \left[ \begin{array}{rr} H_0: & \mu_1 = \mu_2 \\ H_A: & \mu_1 < \mu_2 \end{array} \right.\]

  • Again, we rephrase things as: \[ \left[ \begin{array}{rr} H_0: & \mu_2 - \mu_1 = 0 \\ H_A: & \mu_2 - \mu_1 > 0 \end{array} \right.\] which is now a familiar upper-tailed test on \(\delta = \mu_2 - \mu_1\) and \(\mu_0 = 0.\)

Extensions

  • Specifically, we would take the same test statistic (which would still follow the same distribution under the null) and use the decision rule \[ \texttt{decision}(\mathrm{TS}) = \begin{cases} \texttt{reject } H_0 & \text{if } \mathrm{TS} > c \\ \texttt{fail to reject } H_0 & \text{otherwise}\\ \end{cases} \] where \(c\) is the appropriate quantile of the approximate t distribution (with degrees of freedom given by the Satterthwaite Approximation).

  • A similar result holds for the lower-tailed test- I encourage you to work it out on your own.

Confidence Interval for \(\mu_1-\mu_2\)

Confidence Interval for \(\mu_1-\mu_2\)

As with the confidence interval for one mean, we had the generic form

\[\hat{\theta} \pm c \cdot \text{standard error}\]

So confidence interval for \(\mu_1-\mu_2\) is \[\bar{x}_1-\bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]

Worked-Out Example

Worked-Out Example 2

Since public universities are subsidized by state governments, it is usually less expensive to attend a public university in your home state than a private university. Is this still true when you must pay out-of-state tuition? A random sample of 75 private universities had an average total cost of $42,336 with standard deviation of $14,004. A similar random sample of 80 public universities had an average out-of-state total cost of $32,240 with standard deviation $7,730. Calculate the 90% confidence interval for the difference in population mean total cost for students at private schools and the population mean total cost for out-of-state students at public schools.

Solutions

  • These are both random samples so the observations with each sample are independent of one another.

  • There is no reason why the two samples would not be independent between each other.

  • Given \(n_1 = 75 \geq 30\) and \(n_2 = 80 \geq 30\), we have large enough sample to relax normality assumption and use CLT.

  • To calculate the degrees of freedom:

\[df =\min(n_1-1,n_2-1) = \min(74, 79) = 74\] - Calculate \(t^*\) critical value given \(\alpha = 0.1\) which is 1.6657

Solutions

\[\bar{x}_1-\bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}\]

\[(42336 - 32240) \pm 1.6657 \sqrt{\frac{14004^2}{75}+\frac{7730^2}{80}}\]

\[10096 \pm 3054\]

\[[\$7,042, \$13,150]\]

We are 90% confident that the true difference in population mean total cost for students at private schools and population mean total cost for out-of-state students at public schools is between $7,042 and $13,150.