2 Sampling Distribution of \(\bar X\)

If you were to take a sample from some population, say the heights of 5 randomly selected students in the lab, and calculate the mean of your sample then this is known as a sample mean. Suppose you took a second sample of 5 randomly selected students and again calculated the sample mean height. It is unlikely that the first sample mean would be equal to the second sample mean. It is also unlikely that either of these sample means would be equal to the population height of all students in the lab today.

The sample mean is a statistic and we can find a distribution for a statistic so that properties such as the mean and variance of the sample means can be estimated. The distribution of all possible values that a statistic can take is called the sampling distribution of that statistic.

Let's consider an example of finding the sampling distribution of the sample mean, \(\bar X\), and the sampling distribution of the sample variance, \(S^2\), in a finite population. Because we are looking at a finite population, we will be able to list all values that \(\bar X\) and \(S^2\) can theoretically take.

Imagine we have a bag containing six balls, each with a unique number 1 to 6 printed on them, that we want to sample from. Let \(X\) be the random variable denoting the value printed on the ball that is drawn from the bag. \(X\) can only take the value 1, 2, 3, 4, 5 or 6 and there is a probability of \(\frac{1}{6}\) that any one of these values is drawn in a random draw.

The probability distribution for \(X\) is summarised in Table 2.1.

Table 2.1: Probability distribution for \(X\).
\(x_i\) \(P(X=x_i)\)
1 \(\frac{1}{6}\)
2 \(\frac{1}{6}\)
3 \(\frac{1}{6}\)
4 \(\frac{1}{6}\)
5 \(\frac{1}{6}\)
6 \(\frac{1}{6}\)

What is the mean of the random variable \(X\)?

\(\mu_X=\)

The mean of a random variable in a finite population can be calculated using,

\[ \mu=\frac{\sum_{i=1}^N X_i}{N} \]

Since the random variable \(X\) can only take the values \(\{1,2,3,4,5,6\}\), it comes from a finite population with six elements i.e. \(N=6\). Therefore, we can calculate the mean as,

\[ \begin{aligned} \mu_X&=\frac{1+2+3+4+5+6}{6}\\ &=3.5 \end{aligned} \]

What is the variance of the random variable \(X\)?

\(\sigma^2_X=\)

The variance of a random variable coming from a finite population can be calculate using,

\[ \sigma^2=\frac{1}{N}\sum_{i=1}^N X_i^2-(\mu)^2 \]

NOTE that the denominator is \(N\) (and not \(N-1\)) because this is a population variance and we know the population mean \(\mu\) and don't have to estimate it.

In this case, we can calculate the variance of \(X\) as,

\[ \begin{aligned} \sigma^2&=\frac{1}{6}(1^2+2^2+3^2+4^2+5^2+6^2)-(3.5)^2\\ &=\frac{35}{12}\approx2.9166 \end{aligned} \]

2.1 Sampling with replacement

Because \(X\) comes from a finite population, we can list all possible outcomes of sampling from the bag. If we were to sample with replacement, two balls from the bag, then there are 36 different combinations of values that we might see in a single sample. To list these, we can use the function expand.grid()

The function expand.grid() works by providing as the arguments the vectors of different outcomes that can occur. In this case, all the possible values that \(X\) can take are \(\begin{Bmatrix}1&2&3&4&5&6\end{Bmatrix}\). The function then returns a data frame which shows all combinations of the different outcomes provided in the arguments. Just like when setting up a data frame, you can call the columns anything you like. Here, because we are drawing a ball from the bag on two occasions, the columns have been named Draw1 and Draw2.

samples_w <- expand.grid(Draw1 = 1:6, Draw2 = 1:6)
head(samples_w)
Draw1 Draw2
1 1
2 1
3 1
4 1
5 1
6 1

Note that the first row of samples_w shows that the one of the outcomes of drawing two balls from the bag is that the first ball has the value 1 and the second ball also has the value 1. It is clear then that expand.grid() shows the possible outcomes of sampling with replacement.

Because we are interested in finding the sampling distributions of the mean, \(\bar X\), and of the variance, \(S^2\), it makes sense to calculate what these are for each possible sample from the bag.

Run the code below to calculate the mean and variance for each draw of two balls from the bag, then update the data frame samples_w to contain this information, as well as the sample these values came from.

sample_w_means <- apply(X = samples_w, MARGIN = 1, FUN = mean)

sample_w_variances <- apply(X = samples_w, MARGIN = 1, FUN = var)

samples_w <- cbind(samples_w, Mean = sample_w_means, Variance = sample_w_variances)

The information in the data frame samples_w is summarised in Table 2.2

Table 2.2: Mean and variance for every combination of \((x_1, x_2)\) when sampling with replacement, six numbered balls from a bag.
\((x_1,\,x_2)\) Mean Variance \((x_1,\,x_2)\) Mean Variance \((x_1,\,x_2)\) Mean Variance
(1, 1) 1.0 0.0 (1, 3) 2.0 2.0 (1, 5) 3.0 8.0
(2, 1) 1.5 0.5 (2, 3) 2.5 0.5 (2, 5) 3.5 4.5
(3, 1) 2.0 2.0 (3, 3) 3.0 0.0 (3, 5) 4.0 2.0
(4, 1) 2.5 4.5 (4, 3) 3.5 0.5 (4, 5) 4.5 0.5
(5, 1) 3.0 8.0 (5, 3) 4.0 2.0 (5, 5) 5.0 0.0
(6, 1) 3.5 12.5 (6, 3) 4.5 4.5 (6, 5) 5.5 0.5
(1, 2) 1.5 0.5 (1, 4) 2.5 4.5 (1, 6) 3.5 12.5
(2, 2) 2.0 0.0 (2, 4) 3.0 2.0 (2, 6) 4.0 8.0
(3, 2) 2.5 0.5 (3, 4) 3.5 0.5 (3, 6) 4.5 4.5
(4, 2) 3.0 2.0 (4, 4) 4.0 0.0 (4, 6) 5.0 2.0
(5, 2) 3.5 4.5 (5, 4) 4.5 0.5 (5, 6) 5.5 0.5
(6, 2) 4.0 8.0 (6, 4) 5.0 2.0 (6, 6) 6.0 0.0

We can then use the information contained in samples_w (or in Table 2.2) to construct the sampling distribution for the mean, \(\bar X\). The sampling distribution will tell us all the values that \(\bar X\) can take and the probability with which it takes each value.

Run the code below to create the sampling distribution of \(\bar X\). This is done by dividing the number of times \(\bar X\) takes each of its possible values by 36 (because we know that there are 36 possible combinations of two balls from the bag and that each combination is equally likely, so any combination has a probability of occurring of \(\frac{1}{36}\)).

The function fractions() is used to force R to show these calculated probabilities as a fraction, rather than as a number with many decimal places. The fractions() function is part of a package called MASS so it is necessary to load this into your R session using the code library(MASS).

library(MASS)

samplingdist_mean_w <- fractions(table(samples_w$Mean)/36)
samplingdist_mean_w

   1  1.5    2  2.5    3  3.5    4  4.5    5  5.5    6 
1/36 1/18 1/12  1/9 5/36  1/6 5/36  1/9 1/12 1/18 1/36 

For example, \(\bar X\) is equal to 1.5 on two occasions (if the balls drawn are (2, 1) or (1, 2)). We can then say that \(P\big(\bar X=1.5\big)=\frac{2}{36}=\frac{1}{18}\).

What is the mean of the sampling distribution for \(\bar X\)?

\(\mu_{\bar X}=\)

The mean can be found by finding the expected value of the sampling distribution for \(\bar X\). The expected value of a discrete random variable \(X\) with \(N\) discrete values can be found using,

\[ \mu_X=E[X]=\sum_{i=1}^N x_iP(X=x_i) \]

For the case of \(\bar X\), the mean can then be calculated as,

\[ \begin{aligned} \mu_{\bar X}&=1\times\frac{1}{36}+1.5\times\frac{1}{18}+2\times\frac{1}{12}+\dots+6\times\frac{1}{36}\\ &=\frac{7}{2}=3.5 \end{aligned} \] We can do exactly the same thing using R code. First, we need to extract all the individual values that \(\bar X\) can take by using the function unique(). This will return a vector of all the unique values that appear in the vector given to it as the argument.

We then want to multiply each of these values by the corresponding probability that \(\bar X\) takes that value, as shown in the sampling distribution samplingdist_mean_w, and then sum together these products using the sum() function.

means_w <- unique(samples_w$Mean)
mean_Xbar_w <- sum(means_w*samplingdist_mean_w)
mean_Xbar_w
[1] 7/2

Again, we see that \(\mu_{\bar X}=\frac{7}{2}=3.5\).

Note that the value of \(\mu_{\bar X}\) is the same as the mean of the random variable \(X\), \(\mu_X\). This makes sense because,

\[ \begin{aligned} \mu_{\bar X}=E[\bar X]&=E\bigg[\frac{\sum_{i=1}^n X_i}{n}\bigg]\\ &=\frac{1}{n}\sum_{i=1}^n E[X_i]\\ &=\frac{1}{n}\sum_{i=1}^n\mu_X\\ &=\frac{1}{n}\cdot n\mu_X=\mu_X \end{aligned} \]

provided each \(X_i\) is independent.

What is the variance of the sampling distribution for \(\bar X\)?

\(\sigma^2_{\bar X}=\)

In order to find the variance of a random variable \(X\) with a discrete distribution with \(N\) discrete values, we can use the formula,

\[ \sigma^2_X=Var(X)=\sum_{i=1}^N x_i^2P(X=x_i)-(\mu_X)^2 \]

So for the variance of the mean, \(\bar X\), we can calculate this by,

\[ \begin{aligned} \sigma^2_{\bar X}&=\bigg(1^2\times\frac{1}{36}+1.5^2\times\frac{1}{18}+\dots +6^2\times\frac{1}{36}\bigg)-(3.5)^2\\ &=\frac{35}{24}\approx1.4583 \end{aligned} \]

We can also calculate the same thing in R. To do this, the unique values that \(\bar X\) can take are extracted from samples_w using the function unique(). These are then squared and multiplied by the corresponding probability that \(\bar X\) takes this value. The sum() function is used to sum all of these products, and then \((\mu_{\bar X})^2\), stored in mean_Xbar_w, is subtracted.

means_w <- unique(samples_w$Mean)
var_Xbar_w <- sum(means_w^2*samplingdist_mean_w)-(mean_Xbar_w)^2
var_Xbar_w
[1] 35/24

This confirms that \(\sigma^2_{\bar X}=\frac{35}{24}\approx 1.4583\).

Note that the variance of the sampling distribution for the mean is the variance of the random variable \(X\) divided by the size of our sample (\(n=2\)). That is, \(\sigma^2_{\bar X}=\frac{\sigma_X^2}{n}\). We can prove this is true since,

\[ \begin{aligned} \sigma_{\bar X}^2=Var(\bar X)&=Var\bigg(\frac{\sum_{i=1}^n X_i}{n}\bigg)\\ &=\frac{1}{n^2}\sum_{i=1}^nVar(X_i)\\ &=\frac{1}{n^2}\sum_{i=1}^n\sigma_X^2\\ &=\frac{1}{n^2}\cdot n\sigma_X^2=\frac{\sigma_X^2}{n} \end{aligned} \]

provided each \(X_i\) is independent.

We are also able to find the sampling distribution for the variance, \(S^2\). We again find the number of times the sample variance is equal to each of the distinct values it can take, and divide this by 36. The function fractions() is again used to show these probabilities as fractions. Note that we don't need to run library(MASS) this time, because the MASS package is already loaded into our R session.

samplingdist_var_w <- fractions(table(samples_w$Variance)/36)
samplingdist_var_w

   0  0.5    2  4.5    8 12.5 
 1/6 5/18  2/9  1/6  1/9 1/18 

Here we can see that \(P(S^2=12.5)=\frac{1}{18}\) which we could have calculated from Table 2.2 since the sample variance is equal to 12.5 for only two outcomes (when the balls drawn are (6, 1) or (1, 6)), so the probability \(S^2\) takes the value 12.5 is equal to \(\frac{2}{36}=\frac{1}{18}\).

What is the mean of the sampling distribution for \(S^2\)?

\(\mu_{S^2}=\)

In order to find the mean of a variable \(X\) with a discrete distribution with \(N\) discrete values, we can use,

\[ \mu_X=E[X]=\sum_{i=1}^N x_iP(X=x_i) \]

For the sample variance, we can then calculate the expected value as,

\[ \begin{aligned} \mu_{S^2}&=0\times\frac{1}{6}+0.5\times\frac{5}{18}+2\times\frac{2}{9}+\dots+12.5\times\frac{1}{18}\\ &=\frac{35}{12}\approx 2.9166 \end{aligned} \]

This can be found in R using the following code.

vars_w <- unique(samples_w$Variance)
mean_S2_w <- sum(vars_w*samplingdist_var_w)
mean_S2_w
[1] 35/12

This again shows that \(\mu_{S^2}=\frac{35}{12}\approx 2.9166\). Note that this is equal to the variance of the random variable \(X\), \(\sigma^2_{\bar X}\).

2.2 Sampling without replacement

If instead, we sampled two balls from the bag, but now sampling without replacement, the sampling distributions for the sample mean and for the sample variance will be different because the possible samples of two balls we could end up with are different.

When sampling without replacement, there are fewer outcomes we can end up with. Now there are only \({6\choose 2}=15\) possible outcomes of two balls from the bag of six. We can list all of these possible outcomes with the code below.

Here, we use the function srs() (which stands for simple random sample) to list these outcomes. This function takes the arguments,

  • popvalues =: this is a vector of all values seen in the population.

  • n =: this is the size of the sample to be taken without replacement from the population.

The srs() function is part of the PASWR2 package, so it is necessary to run the code library(PASWR2). The function dimnames() is used to set the names of columns in samples_wo to "Draw1" and "Draw2".

library(PASWR2)

samples_wo <- srs(popvalues = 1:6, n = 2)
dimnames(samples_wo) <- list(NULL, c("Draw1", "Draw2"))
head(samples_wo)
     Draw1 Draw2
[1,]     1     2
[2,]     1     3
[3,]     1     4
[4,]     1     5
[5,]     1     6
[6,]     2     3

Then, we can calculate the sample mean and sample variance for each possible sample of two balls from the bag. This is done using the apply() function in the code below. Then, the data frame samples_wo is updated to include these values and the samples that they came from.

sample_wo_means <- apply(X = samples_wo, MARGIN = 1, FUN = mean)

sample_wo_variances <- apply(X = samples_wo, MARGIN = 1, FUN = var)

samples_wo <- data.frame(samples_wo, Mean = sample_wo_means,
                         Variance = sample_wo_variances)

The information in samples_wo is summarised in Table 2.3.

Table 2.3: Mean and variance for every combination of \((x_1, x_2)\) when sampling without replacement, six numbered balls from a bag.
\((x_1,\,x_2)\) Mean Variance
(1, 2) 1.5 0.5
(1, 3) 2.0 2.0
(1, 4) 2.5 4.5
(1, 5) 3.0 8.0
(1, 6) 3.5 12.5
(2, 3) 2.5 0.5
(2, 4) 3.0 2.0
(2, 5) 3.5 4.5
(2, 6) 4.0 8.0
(3, 4) 3.5 0.5
(3, 5) 4.0 2.0
(3, 6) 4.5 4.5
(4, 5) 4.5 0.5
(4, 6) 5.0 2.0
(5, 6) 5.5 0.5

Then we can use the sample means contained in samples_wo to find the sampling distribution of \(\bar X\). The code below creates this distribution by dividing the number of times \(\bar X\) takes any of its unique values by 15 (because there are 15 possible combinations of two draws from the bag, each with equal probability of occurring). The fractions() function from the MASS package has again been used to show these probabilities as rational numbers.

samplingdist_mean_wo <- fractions(table(samples_wo$Mean)/15)
samplingdist_mean_wo

 1.5    2  2.5    3  3.5    4  4.5    5  5.5 
1/15 1/15 2/15 2/15  1/5 2/15 2/15 1/15 1/15 

This tells us that, when sampling two balls from the bag of six without replacement, the probability that the mean of that sample is equal to 3 for example, is \(\frac{2}{15}\) i.e. \(P(\bar X=3)=\frac{2}{15}\).

What is the mean of the sampling distribution for \(\bar X\), when sampling without replacement?

\(\mu_{\bar X}=\)

In order to find the mean of a discrete random variable \(X\) with \(N\) discrete values, we can use the formula,

\[ \mu_X=E[X]=\sum_{i=1}^N x_iP(X=x_i) \]

For the sample mean \(\bar X\), the mean can then be found as,

\[ \begin{aligned} \mu_{\bar X}&=1.5\times\frac{1}{15}+2\times\frac{1}{15}+2.5\times\frac{2}{15}+\dots+5.5\times\frac{1}{15}\\ &=\frac{7}{2}=3.5 \end{aligned} \]

This can also be found in R using the following code.

means_wo <- unique(samples_wo$Mean)
mean_Xbar_wo <- sum(means_wo*samplingdist_mean_wo)
mean_Xbar_wo
[1] 7/2

This shows that \(\mu_{\bar X}=\frac{7}{2}=3.5\). Note that the expected value of the sampling distribution for \(\bar X\) is equal to the mean of the distribution for random variable \(X\) i.e. \(\mu_{\bar X}=E[\bar X]=\mu_X\).

What is the variance of the sampling distribution for \(\bar X\), when sampling without replacement?

\(\sigma^2_{\bar X}=\)

The formula used to find the variance of a discrete random variable \(X\) with \(N\) discrete values is,

\[ \sigma^2_X=Var(X)=\sum_{i=1}^N x_i^2P(X=x_i)-(\mu_X)^2 \]

Therefore, the variance of the sampling distribution for \(\bar X\) can be calculated as,

\[ \begin{aligned} \sigma^2_{\bar X}&=\bigg(1.5^2\times\frac{1}{15}+2^2\times\frac{1}{15}+\dots +5.5^2\times\frac{1}{15}\bigg)-(3.5)^2\\ &=\frac{7}{6}\approx1.1667 \end{aligned} \]

The same can found in R using the code below.

means_wo <- unique(samples_wo$Mean)
var_Xbar_wo <- sum(means_wo^2*samplingdist_mean_wo)-(mean_Xbar_wo)^2
var_Xbar_wo
[1] 7/6

This shows us again that \(\sigma^2_{\bar X}=\frac{7}{6}\approx1.1667\).

Another way we can find the variance of the sampling distribution for \(\bar X\) when sampling without replacement, is to use the formula,

\[ \sigma^2_{\bar X}=\frac{\sigma_X^2}{n}\cdot\frac{N-n}{N-1} \]

We can also find the sampling distribution for the sample variance, \(S^2\). This is done in the following code by counting the number of times \(S^2\) takes each unique value and dividing that total by 15. Once again, fractions() shows these values as rational numbers.

samplingdist_var_wo <- fractions(table(samples_wo$Variance)/15)
samplingdist_var_wo

 0.5    2  4.5    8 12.5 
 1/3 4/15  1/5 2/15 1/15 

This distribution states that when sampling without replacement, the probability that the variance of a sample of two balls out of six is equal to 8, for example, is \(\frac{2}{15}\) i.e. \(P(S^2=8)=\frac{2}{15}\).

What is the expected value of the sampling distribution for \(S^2\), when sampling without replacement?

\(E[S^2]=\)

The formula to use here is,

\[ \mu_X=E[X]=\sum_{i=1}^N x_iP(X=x_i) \]

The expected value of the sample variance is then,

\[ \begin{aligned} E[S^2]&=0.5\times\frac{1}{3}+2\times\frac{4}{15}+4.5\times\frac{1}{5}+\dots+12.5\times\frac{1}{15}\\ &=\frac{7}{2}= 3.5 \end{aligned} \]

We can find the same result in R using the following code.

vars_wo <- unique(samples_wo$Variance)
mean_S2_wo <- sum(vars_wo*samplingdist_var_wo)
mean_S2_wo
[1] 7/2

This shows us once again that \(E[S^2]=\frac{7}{2}=3.5\).

The relationship between the expected value of the sample variance, when sampling without replacement, and the variance of the random variable \(X\) can be summarised with the formula,

\[ E[S^2]=\frac{N}{N-1}\cdot\sigma_X^2 \]

We can summarise all of the results we've seen so far about the sampling distributions of \(\bar X\) and \(S^2\), when sampling with or without replacement, in Table 2.4. 3.5

Table 2.4: Summary of properties of the distribution for \(X\) and sampling distributions for \(\bar X\) and \(S^2\).
\(\mu_X\) \(\sigma^2_X\) \(E[\bar X]\) \(\sigma^2_{\bar X}\) \(E[S^2]\)
With replacement 3.5 2.9167 3.5 \(=\mu_X\) 1.4583 \(=\frac{\sigma_X^2}{n}\) 2.9167 \(=\sigma_X^2\)
Without replacement 3.5 2.9167 3.5 \(=\mu_X\) 1.1667 \(=\frac{\sigma_X^2}{n}\cdot\frac{N-n}{N-1}\) 3.5 \(=\frac{N}{N-1}\cdot\sigma_X^2\)

The expected value of the sample mean is \(\mu_X\), regardless of whether we are sampling with or without replacement. Both the variance of the sample mean and the expected value of the sample variance change though, depending on whether sampling with or without replacement has been used.


To follow an example of creating the sampling distribution for the mean and for the variance from a finite population, see Section 6.4 Sampling Distribution of \(\bar X\) in Probability and Statistics with R.