2 Sampling Distribution of ˉX

Suppose we were to take multiple random samples, each of the same size, from a distribution. We would be able to calculate the mean of each sample, i.e. each sample mean, and be left with many different values of ˉX. The distribution of all of these sample means is called the sampling distribution of ˉX.

2.1 Central Limit Theorem

An important result regarding the sampling distribution of ˉX is that, provided the sample size is sufficiently large, the sampling distribution of ˉX will follow a normal distribution. For this to be true, the random variable X does not need to follow a normal distribution itself (like in Section 6.5.1.1 of Probability and Statistics with R). This result is know as the Central Limit Theorem.

Formally, for a random variable X which follows any distribution with known mean, μ, and standard deviation, σ, then

Z=ˉXμσnN(0,1) as n.

This can be rephrased to say that the sampling distribution of ˉX is approximately N(μ,σn), provided the size of the sample taken, n, is sufficiently large.

This term sufficiently large should make you wonder, how big does a sample need to be to be considered sufficiently large? This actually depends on the distribution that the random variable X follows.

Let's look at an example of sampling from different distributions to determine how big a sample we need for the sampling distribution of ˉX to be considered normal. Specifically, we will sample from the Unif(11,41) and Expo(115) distributions. Note that both these distributions have mean and standard deviation approximately equal to 15 (you should try and show this yourself using results from Stats 2R Probability).

If we were to take samples of size n=2 from the Unif(11,41) distribution, what distribution would ˉX follow, according to the Central Limit Theorem?

If XUnif(11,41), then we can show that,

  • E[X]=15=μ
  • sd(X)15=σ

We can then use these values to show that, according to the Central Limit Theorem,

ˉXN(15,152)

0 of 1 correct

If we were to take samples of size n=2 from the Expo(115) distribution, what distribution would ˉX follow, according to the Central Limit Theorem?

If XExpo(115), then we can show that,

  • E[X]=15=μ
  • sd(X)=15=σ

We can then use these values to show that, according to the Central Limit Theorem,

ˉXN(15,152)

0 of 1 correct

To see if the sampling distribution of ˉX is indeed normal, with mean μ=15 and standard deviation σ=152, we want to draw m=10,000 random samples of size n=2 from both the Unif(11,41) and Expo(115) distributions and use these samples to calculate 10,000 values of the sample mean, ˉx. We can then plot the distribution of these 10,000 sample means and compare it to the N(15,152) distribution for both cases.

To start with, let's set up two empty vectors, means_unif_2 and means_expo_2, which will be used to store each of the 10,000 means calculated from samples taken from the Uniform and Exponential distributions respectively. This is done using the function numeric() which creates an empty vector with length equal to the value you provide as the argument, in this case m = 10,000.

m <- 10000

means_unif_2 <- numeric(m)
means_expo_2 <- numeric(m)

To draw the samples from each distribution and calculate the mean, a for loop is used. For the Uniform distribution, runif() is used to draw a random sample of size n=2 from the Unif(11,41) distribution, and then the function mean() is used to calculate the mean of these two values. The for loop stores this mean as one of the values in the vector means_unif_2, and repeats until 10,000 random samples have been drawn and used to calculate a mean.

The same process is used to draw the samples from the Exponential distribution, except the function rexp() is used to draw the random samples (for more details on R's probability functions and flow control, see Lab 3: Section 4 and Lab 3: Section A.3 respectively).

for(i in 1:m){
  means_unif_2[i] <- mean(runif(n = 2, min = -11, max = 41))
}

for(i in 1:m){
  means_expo_2[i] <- mean(rexp(n = 2, rate = 1/15))
}

Before we plot either of these sampling distributions, the sample means can be stored within a data frame. The following code creates the data frame, means_2, which has a column storing the value of the mean calculated for each sample, and a second column stating which distribution this sample mean was from.

means_2 <- data.frame(mean = c(means_unif_2, means_expo_2),
                      distribution = rep(c("Uniform", "Exponential"), each = m))

means_2$distribution <- factor(x = means_2$distribution,
                               levels = c("Uniform", "Exponential"))

head(means_2)
mean distribution
6.3830075 Uniform
19.7496570 Uniform
26.5513493 Uniform
-0.1553435 Uniform
28.2117881 Uniform
-6.9693977 Uniform

Now, we are ready to plot the two sampling distributions of ˉX from the Unif(11,41) and Expo(115) distributions. This can be done using the ggplot2 package (see Lab 4: Section 5) using the following code.

ggplot(data = means_2) +
  geom_density(aes(x = mean), fill = "skyblue") +
  facet_grid(distribution ~ .) +
  coord_cartesian(xlim = c(-11, 41)) +
  labs(title = "Simulated sampling distributions of the sample mean",
       subtitle = "Sample size n = 2",
       x = expression(bar(x)), y = "Density") +
  stat_function(fun = dnorm, args = list(mean = 15, sd = 15/sqrt(2)), col = "black",
                linetype = 2, fill = "black", alpha = 0.2, geom = "area")

Let's take a look at each layer in turn.

  • geom_density(): this tells ggplot2 that we want to create a kernel density estimate of the means stored within the data frame means_2, along the x-axis.
  • facet_grid(): because means_2 stores the means for samples taken from both the Uniform and Exponential distributions, we want two different plots of the sampling distribution of ˉX from each. Here we specify that we want each row of plots to be for a different distribution.
  • coord_cartesian(): we can specify the range of values we want to show along the x-axis and limit this range to a sensible one to show the full scope of each distribution.
  • labs(): this allows us to give the plots some axes labels and a title.
  • stat_function(): this superimposes a defined function above the density plots already created. According to the Central Limit Theorem, the sampling distribution of ˉX from both the Unif(11,41) and Expo(115) distributions should be approximately N(15,152) when the sample size is n=2, so this is the distribution we want to show.

Using stat_function() as a layer in a ggplot2 plot allows you to draw any defined function above the rest of the figure. This is useful when you want to include plots of data which are not from the data frame provided in the data = argument of the ggplot() function. The arguments that stat_function() can be supplied with are:

  • fun =: this is the name of the function in R that will calculate the values to be plotted against different values along the x-axis. If you wish to plot a probability distribution, then use the name of the function that calculates the corresponding values of the probability density function (PDF) e.g. for the normal distribution, use dnorm, or for the exponential distribution, use dexp.
  • args =: this is a list of all the argument values to be provided the R function named in the fun = argument. Different functions will require different arguments, so check that the ones you provide make sense with the function you've stated in the fun = argument.
  • geom =: this is the name of the geometric object you want to use to show the data. Examples include "area" to plot the area under the curve, "point" to plot points representing the value at different points along the x-axis, or "polygon" to plot the area between the minimum value along the x-axis, the maximum value and the area under the curve.
  • col =: this is the colour to use for the line on the plot representing the function.
  • linetype =: this is an integer value indicating which style of line you would like to plot.
  • fill =: this is the colour to be used to fill in the area under the curve. In order for this to be used, the argument `geom
  • alpha =: this sets the opacity of the fill colour. You can set this argument to be any value between 0 and 1 - a value of 1 means the colour will be fully opaque and a value of 0 means the colour will be transparent.

The code above produces the following two plots.

ggplot(data = means_2) +
  geom_density(aes(x = mean), fill = "skyblue") +
  facet_grid(distribution ~ .) +
  coord_cartesian(xlim = c(-11, 41)) +
  labs(title = "Simulated sampling distributions of the sample mean",
       subtitle = "Sample size n = 2",
       x = expression(bar(x)), y = "Density") +
  stat_function(fun = dnorm, args = list(mean = 15, sd = 15/sqrt(2)), geom = "area",
                col = "black", linetype = 2, fill = "black", alpha = 0.2)
Simulated sampling distributions of the sample mean for samples of size n = 2 from a Unif(−11,41) and an Expo(1/15) distributions with superimposed normal distributions.

Figure 2.1: Simulated sampling distributions of the sample mean for samples of size n = 2 from a Unif(−11,41) and an Expo(1/15) distributions with superimposed normal distributions.

Do you think the the sampling distribution of ˉX, when samples of size n=2 are taken from the Unif(11,41) distribution, is approximately normal?

0 of 1 correct

What about the sampling distribution of ˉX when samples of size n=2 are taken from the Expo(115) distribution - is it approximately normal?

0 of 1 correct

It seems that a sample size of n=2 is not sufficiently large to give a sampling distribution of ˉX which follows the normal distribution, when we are sampling from the Exponential distribution. In order for the Central Limit Theorem to hold true then, we need to take larger samples from the Expo(115) distribution.

Let's take samples of size n=10 and n=40 from both the Unif(11,41) and Expo(115) distributions, to investigate how the sample size affects the sampling distribution of ˉX.

In the code below, we take 10,000 random samples of size n=10 from the Uniform and Exponential distributions and then save the mean of each sample in either the vector means_unif_10, if the sample was taken from the Uniform distribution, or means_expo_10, if the sample was from the Exponential distribution.

means_unif_10 <- numeric(m)
means_expo_10 <- numeric(m)

for(i in 1:m){
  means_unif_10[i] <- mean(runif(n = 10, min = -11, max = 41))
}

for(i in 1:m){
  means_expo_10[i] <- mean(rexp(n = 10, rate = 1/15))
}

Similarly, we can take samples of size n=40 from each distribution and save the mean of each sample in a vector using the following code.

means_unif_40 <- numeric(m)
means_expo_40 <- numeric(m)

for(i in 1:m){
  means_unif_40[i] <- mean(runif(n = 40, min = -11, max = 41))
}

for(i in 1:m){
  means_expo_40[i] <- mean(rexp(n = 40, rate = 1/15))
}

Complete the code below to draw 10,000 samples of size n=100 from the Unif(11,41) distribution, and another 10,000 samples of size n=100 from the Expo(115) distribution.

 means_unif_100 <- numeric(m)
 means_expo_100 <- numeric(m)
    for( i in ){
    means_unif_100[ ] <- mean( (n = , min = -11, max = 41))
    }
    for( i in ){
    means_expo_100[ ] <- mean( (n = , rate = ))
    }
means_unif_100 <- numeric(m)
means_expo_100 <- numeric(m)

for(i in 1:m){
  means_unif_100[i] <- mean(runif(n = 100, min = -11, max = 41))
}

for(i in 1:m){
  means_expo_100[i] <- mean(rexp(n = 100, rate = 1/15))
}
0 of 9 correct

Now that we have all of the sample means saved in various vectors, we will need to compile them all into a data frame in order for us to be able to plot their respective distributions using ggplot2. The code below creates the data frame means, which stores each sample mean, along with the distribution the sample was taken from and the size of the sample.

Note that we are including the sample means from the samples of size n=2 in this data frame for completeness.

means <- data.frame(mean = c(means_unif_2, means_expo_2,
                             means_unif_10, means_expo_10,
                             means_unif_40, means_expo_40,
                             means_unif_100, means_expo_100),
                    distribution = rep(rep(c("Uniform", "Exponential"), each = m),
                                       times = 4),
                    size = rep(c("n = 2", "n = 10", "n = 40", "n = 100"), each = 2*m))

means$distribution <- factor(means$distribution, levels = c("Uniform", "Exponential"))

means$size <- factor(means$size, levels = c("n = 2", "n = 10", "n = 40", "n = 100"))

We can then plot the two sampling distributions of ˉX from the Unif(11,41) and Expo(115) distributions when the sample size is n=10 using the code below. We have to subset the data from means to show only the means which are from a sample size of n=10. Also, because the sample size has changed, the distribution that the Central Limit Theorem states ˉX will follow in both cases is now N(15,1510), so this is the distribution we want to superimpose using stat_function().

ggplot(data = subset(means, subset = (size == "n = 10"))) +
  geom_density(aes(x = mean), fill = "skyblue") +
  facet_grid(distribution ~ .) +
  coord_cartesian(xlim = c(-1, 31)) +
  labs(title = "Simulated sampling distributions of the sample mean",
       subtitle = "Sample size n = 10",
       x = expression(bar(x)), y = "Density") +
  stat_function(fun = dnorm, args = list(mean = 15, sd = 15/sqrt(10)), geom = "area",
                col = "black", linetype = 2, fill = "black", alpha = 0.2)
Simulated sampling distributions of the sample mean for samples of size n = 10 from a Unif(−11,41) and an Expo(1/15) distributions with superimposed normal distributions.

Figure 2.2: Simulated sampling distributions of the sample mean for samples of size n = 10 from a Unif(−11,41) and an Expo(1/15) distributions with superimposed normal distributions.

We can also show the sampling distributions of ˉX from the Uniform and Exponential distributions when the sample size is n=40. This done using the code below. Again, note that the distribution the Central Limit Theorem states the sampling distribution of ˉX will follow is now N(15,1540) because the sample size has changed.

ggplot(data = subset(means, subset = (size == "n = 40"))) +
  geom_density(aes(x = mean), fill = "skyblue") +
  facet_grid(distribution ~ .) +
  coord_cartesian(xlim = c(6, 24)) +
  labs(title = "Simulated sampling distributions of the sample mean",
       subtitle = "Sample size n = 40",
       x = expression(bar(x)), y = "Density") +
  stat_function(fun = dnorm, args = list(mean = 15, sd = 15/sqrt(40)), geom = "area",
                col = "black", linetype = 2, fill = "black", alpha = 0.2)
Simulated sampling distributions of the sample mean for samples of size n = 40 from a Unif(−11,41) and an Expo(1/15) distributions with superimposed normal distributions.

Figure 2.3: Simulated sampling distributions of the sample mean for samples of size n = 40 from a Unif(−11,41) and an Expo(1/15) distributions with superimposed normal distributions.

Complete the code below to plot the sampling distribution of ˉX for the samples of size n=100 taken from the Unif(11,41) and Expo(115) distributions. Superimpose the normal distribution that each sampling distribution follows according to the Central Limit Theorem.

 ggplot(data = subset(means, subset = (size == "n = 100"))) +
 
    (aes(x = mean), fill = "skyblue") +
    facet_grid( ) +
     coord_cartesian(xlim = c(9, 21)) +
     labs(title = "Simulated sampling distributions of the sample mean",
          subtitle = "Sample size n = 100",
          x = expression(bar(x)), y = "Density") +
    stat_function(fun = dnorm, args = list(mean = , sd = ),
    geom = , col = "black", linetype = 2, fill = "black",
    alpha = 0.2)
ggplot(data = subset(means, subset = (size == "n = 100"))) +
  geom_density(aes(x = mean), fill = "skyblue") +
  facet_grid(distribution ~ .) +
  coord_cartesian(xlim = c(9, 21)) +
  labs(title = "Simulated sampling distributions of the sample mean",
       subtitle = "Sample size n = 100",
       x = expression(bar(x)), y = "Density") +
  stat_function(fun = dnorm, args = list(mean = 15, sd = 15/sqrt(100)),
                geom = "area", col = "black", linetype = 2, fill = "black",
                alpha = 0.2)

0 of 5 correct

Section 6.5.1.2 Second Case: Sampling Distribution of ˉX When X Is Not a Normal Random Variable of Probability and Statistics with R details a similar example of finding the distribution of ˉX through sampling.