4 Functions
We have already seen and used many R functions. Now we are going to learn about some more statistics specific functions, as well as how to write our own functions which will give us a lot of flexibility with what we can do in R.
4.1 Probability functions
There are a range of functions we can use which make working with statistical distributions a lot easier. We are able to generate random numbers from a distribution, calculate cumulative probabilities, compute densities and return quantiles with them.
The name of the distribution you are using is going to be part of the function name, so to keep things simple, let's start by looking at the normal distribution. The four functions you can use with the normal distribution are:
rnorm(n, mean = 0, sd = 1)
: this returns a random sample of sizen
from a \(N\sim(0,\,1)\) distribution.pnorm(q, mean = 0, sd = 1)
: this computes the probability \(\mathbb{P}(X\leq q)\), where \(X\sim N(0,\,1)\).dnorm(x, mean = 0, sd = 1)
: this computes the value of the probability density function, \(f(x)\).qnorm(p, mean = 0, sd = 1)
: this computes the quantile \(x\) such that \(\mathbb{P}(X\leq x)=p\), where \(X\sim N(0,\,1)\).
Note that norm
in each function name is because we are using the normal distribution. This part of the function names can be changed based on the distribution we want to use. For example rt()
, rbinom()
and rf()
each return random samples from the \(t\), binomial and \(F\) distributions, respectively, provided suitable arguments are give - remember to use the help()
function to see what arguments are needed.
The arguments mean = 0
and sd = 1
are the default values, so leaving them out of any of the functions for the normal distribution above means you will be using the \(N(0,\,1)\) distribution. You can change the mean or standard deviation of the distribution by changing the values assigned to the arguments.
The values that pnorm(q)
, qnorm(p)
and dnorm(x)
return are summarised graphically in Figure 4.1.
For example, we know that the 97.5th quantile of the \(N(0,\,1)\) distribution is roughly 1.96. We can double check this using the following code.
[1] 1.959964
Choose the correct function and complete the code for the following scenarios.
- You want to construct a 90% confidence interval so need to know the 95th quantile of the standard normal distribution.
(
, mean = 0, sd = 1)
- How would you find the value of \(x\) such that \(\mathbb{P}(X\leq x)=0.45\), where \(X\sim N(100,\,4^2)\)?
(0.45,
)
- You want to know the proportion of the \(N(0,\,2^2)\) distribution that lies below -2. That is, \(\mathbb{P}(X\leq -2)\), where \(X\sim N(0,\,2^2)\).
(
, mean = 0, sd =
)
- You want to construct a 90% confidence interval so need to know the 95th quantile of the standard normal distribution.
- How would you find the value of \(x\) such that \(\mathbb{P}(X\leq x)=0.45\), where \(X\sim N(100,\,4^2)\)?
- You want to know the proportion of the \(N(0,\,2^2)\) distribution that lies below -2. That is, \(\mathbb{P}(X\leq -2)\), where \(X\sim N(0,\,2^2)\).
The normal distribution is not the only one that we can use the four functions introduced above with. Table 4.1 shows some other distributions that can be used. In order to use a different distribution, simply change the norm
part in the function name to the distribution's R name. Change the arguments mean =
and sd =
to the relevant arguments for your chosen distribution as well.
Distribution | R Name | Arguments |
---|---|---|
Normal |
norm
|
mean = 0 : the mean with default value 0.sd = 1 : the standard deviation with default value 1.
|
Binomial |
binom
|
size = : the number of trials.prob = : the probability of success for each trial.
|
Exponential |
exp
|
rate = 1 : the value of \(\theta\) with default value 1.
|
Geometric |
geom
|
prob = : the probability of success in each trial.
|
Hypergeometric |
hyper
|
m = : the number of objects of type I in the population.n = : the number of objects not of type I in the population.k = : the size of the sample taken from the population.
|
Negative Binomial |
nbinom
|
size = : the number of successful trials you want to observe.prob = : the probability of success in each trial.
|
Poisson |
pois
|
lambda = : the value of \(\lambda\).
|
Student's t |
t
|
df = : the degrees of freedom.
|
Uniform |
unif
|
min = 0 : the lower limit of the distribution with default value 0.max = 1 : the upper limit of the distribution with default value 1.
|
Chi-square |
chisq
|
df = : the degrees of freedom.
|
F |
f
|
df1 : the first degrees of freedom.df2 : the second degrees of freedom.
|
To read more, see Section 1.15 Probability Functions in Probability and Statistics with R.
4.2 Creating functions
So far we have used a number of in built functions in R. It is also possible to create your own functions, specifying the required arguments and what it does.
The setup for creating your own function using function()
is as follows.
fname
: this is the name you want to give to your function. It can be anything you choose but try not to use names of functions that already exist.argument1
: this is the name of the first argument to be given when you use your function. You can add as many arguments as is necessary for your function to work.expression
: this is what you want your function to evaluate. You write this out using the argument names you have specified.
For example, we can write a function, normal()
, that:
- draws
n
random values from a normal distribution with meanm
and standard deviations
- orders them from smallest to largest
- for each random value \(x\), returns the probability \(\mathbb{P}(X\leq x)\) and the value of the probability density function at this value, \(f(x)\)
normal <- function(n, m = 0, s = 1){
x <- rnorm(n = n, mean = m, sd = s)
x <- sort(x)
prob <- pnorm(q = x, mean = m, sd = s)
pdf <- dnorm(x = x, mean = m, sd = s)
cbind(x, prob, pdf)
}
Here we have specified default values of m = 0
and s = 1
in our function. This means that if someone uses the function normal
without specifying values for these arguments, then it will automatically use m = 0
and s = 1
.
Note that any vectors we create within the expression for the function are not saved to your Environment tab.
We can then use this new function normal()
to see this information for 5 random values from the \(N(200, 40)\) distribution.
x prob pdf
[1,] 194.7766 0.4480522 0.009888883
[2,] 201.2908 0.5128713 0.009968366
[3,] 204.4489 0.5442796 0.009912060
[4,] 234.8700 0.8083276 0.006820707
[5,] 268.3736 0.9563059 0.002314091
The arguments you use within a function can be either 'named' or 'positional'.
Named arguments are ones where you use the name given in the setup of the function. For example, in our function normal()
, the argument for the mean is called m
. You can change the value of the mean using m = 4
within the normal()
function for example.
Positional arguments are ones where you don't use the name of the argument given in the setup of the function. For example, we could replicate the above output using the following code where the argument names have not been used.
x prob pdf
[1,] 194.7766 0.4480522 0.009888883
[2,] 201.2908 0.5128713 0.009968366
[3,] 204.4489 0.5442796 0.009912060
[4,] 234.8700 0.8083276 0.006820707
[5,] 268.3736 0.9563059 0.002314091
The important thing to remember when using positional arguments is that they are in the same order that their names are specified in the setup of the function so R knows how to match up the values correctly. For example, in the normal()
function, the mean needs to be the second argument specified and the standard deviation needs to be the third.
For more information on writing your own functions in R, see Section 1.17 Creating Functions in Probability and Statistics with R.