3 Data Structures
R uses various different types of 'data structures' which are really just ways to store information of varying types. A lot of the data structures R uses are defined in terms of vectors, so it is important you are familiar with constructing and manipulating vectors in R (see S2S Lab 1).
Some of the common data structures we will become familiar with are;
Each of the data structures presented here are covered using additional examples in Section 1.9 R Data Structures of Probability and Statistics with R.
3.1 Arrays
Arrays are defined as multidimensional arrangements of elements. This means that rather than storing data in a one dimensional vector, you can spread the elements of this vector across multiple dimensions.
This sounds quite complicated so let's look at an example. First, let's create a long vector of the numbers 1 up to 24 to give us elements to populate this array with.
We can then use the array()
function to turn the vector vect
into an array. The array()
function takes the following arguments:
data =
: this is the vector of elements that we want to populate the array with.dim =
: this is another vector giving the maximum number of rows first, then the maximum number of columns and finally the maximum number of 'layers'.
We can turn vect
into an array, called A1
, using the following code.
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
, , 2
[,1] [,2] [,3] [,4]
[1,] 9 11 13 15
[2,] 10 12 14 16
, , 3
[,1] [,2] [,3] [,4]
[1,] 17 19 21 23
[2,] 18 20 22 24
Here we have populated A1
with the values 1 to 24, so that it has dimensions 2 \(\times\) 4 \(\times\) 3. This means that we have created 3 layers where each layer is a 2 \(\times\) 4 matrix.
The values from vect
are entered into A1
going down the columns first, then moving from left to right before moving onto the next layer. This order of entering elements is called column-major order, since columns are filled in first.
You can learn more about creating arrays in Section 1.9.1 Arrays and Matrices of Probability and Statistics with R.
3.2 Matrices
Creating matrices
Matrices can be thought of as two dimensional arrays i.e. they don't have an argument saying how many layers they should contain. Therefore, matrices can also be created using the array()
function and ensuring that the dim =
argument is provided with a vector only of length two. This vector then corresponds to the number of rows and columns, respectively, that the matrix has.
To create a matrix called M1
, which is populated with the elements from vect
and has 6 rows and 4 columns, we can use the following code.
[,1] [,2] [,3] [,4]
[1,] 1 7 13 19
[2,] 2 8 14 20
[3,] 3 9 15 21
[4,] 4 10 16 22
[5,] 5 11 17 23
[6,] 6 12 18 24
Since M1
has dimensions 6 \(\times\) 4, it contains 24 elements - the same number of elements as is in the vector vect
. If instead we had defined dimensions which gave fewer than 24 elements for M1
, then array()
would go through and fill in the elements of M1
with the elements of vect
in column-major order, until there were no spaces left to fill. This would mean that not all the values from vect
would appear in M1
.
[,1] [,2]
[1,] 1 7
[2,] 2 8
[3,] 3 9
[4,] 4 10
[5,] 5 11
[6,] 6 12
If we had defined the dimensions so that M1
contained more than 24 elements, then the values of vect
would be repeated for as long as necessary until all of the elements of M1
have a value.
[,1] [,2] [,3] [,4] [,5]
[1,] 1 7 13 19 1
[2,] 2 8 14 20 2
[3,] 3 9 15 21 3
[4,] 4 10 16 22 4
[5,] 5 11 17 23 5
[6,] 6 12 18 24 6
Use the array()
function and the letters
vector to create a 5 \(\times\) 5 matrix containing the letters of the alphabet in column-major order, up to "y".
It is also possible to create matrices using the matrix()
function. This has the advantage of allowing you to specify whether the elements should be filled in using column-major order or row-major order (where the elements are filled in from left to right along rows and then from top to bottom). The arguments that the matrix()
function can be given are:
data =
: this is the vector of elements that we want to fill in the matrix with.nrow =
: this is the number of rows the matrix should contain.ncol =
: this is the number of columns the matrix should contain.byrow =
: this takes valuesTRUE
orFALSE
and states whether the elements should be entered in row-major order (TRUE
) or column-major order (FALSE
). By default, the value isFALSE
and elements will be entered in column-major order if you miss out this argument.
Only one of nrow =
or ncol =
needs to be included in the matrix()
function because R will automatically calculate how many of the non-specified argument are required, based on the length of the vector given to data =
.
We can create the same matrix as M1
using the matrix()
function. Let's call it M2
.
[,1] [,2] [,3] [,4]
[1,] 1 7 13 19
[2,] 2 8 14 20
[3,] 3 9 15 21
[4,] 4 10 16 22
[5,] 5 11 17 23
[6,] 6 12 18 24
If we wanted to fill in the elements in row-major order, then we could instead use the following code.
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
[6,] 21 22 23 24
Create a 5 \(\times\) 5 matrix containing the first 25 letters of the alphabet using matrix()
. Fill in the elements in row-major order.
To check the dimensions of a matrix (or an array), we use the function dim()
. For matrices, this will return a vector of length 2 where the first value is the number of rows and the second value is the number of columns.
For example we can see that M1
has 6 rows and 4 columns using the following code.
[1] 6 4
Naming rows and columns
It is possible to name rows and columns of a matrix. This is useful if it holds some data corresponding to different variables that you want to label, for example we might want to represent the following table, showing the number of births recorded in four different cities in Scotland in the years 2017, 2018 and 2019, as a matrix.
2017 | 2018 | 2019 | |
---|---|---|---|
Edinburgh | 5033 | 4899 | 4683 |
Glasgow | 6852 | 6548 | 6553 |
Aberdeen | 2402 | 2337 | 2260 |
Dundee | 1493 | 1488 | 1417 |
We can create a matrix containing these values and then name the rows and columns using the following code.
data <- c(5033, 4899, 4683, 6852, 6548, 6553, 2402, 2337, 2260, 1493, 1488, 1417)
births <- matrix(data = data, nrow = 4, byrow = TRUE)
cities <- c("Edinburgh", "Glasgow", "Aberdeen", "Dundee")
years <- c("2017", "2018", "2019")
dimnames(births) <- list(cities, years)
births
2017 2018 2019
Edinburgh 5033 4899 4683
Glasgow 6852 6548 6553
Aberdeen 2402 2337 2260
Dundee 1493 1488 1417
Here, we have used the function dimnames()
. By itself, dimnames()
will extract the row and column names of a matrix, but we can also set these names by using the assign operator <-
. We have put the two vectors on the right hand side of the <-
operator, meaning we want the row and column names to be these two vectors (we'll look at the list()
function these vectors have been wrapped in in more detail in Section 3.5).
We can now use either the row/column names to extract particular elements from the matrix. We do this using square brackets, [ ]
, similar to with vectors, but we now need to specify the row and column we are interested in.
For example, if we wanted to extract the number of births in Dundee in 2017, we can run either of the following lines of code.
[1] 1493
[1] 1493
Note that when we use the row/column names, they are in quotation marks because they are saved as character vectors in R. It is important that the row you are interested in is stated first in [ ]
, and then the column.
What is the code you would use to show the number of births in Glasgow in all 3 years?
Dimension reduction
When extracting an entire row or column using [ ]
, the object that R returns is a vector rather than a matrix. This means that we can't use some functions that only work for matrices (or arrays).
For example, the code below returns the value NULL
when using the dim()
function. dim()
should return the dimensions of an array, but since the extracted row for births in Edinburgh is a vector, there are no dimensions to return.
NULL
If we want to know how many elements are in a vector, we use the function length()
.
[1] 3
We can force the output from subseting a matrix/array to be a matrix/array by including a third argument, drop = FALSE
, within the square brackets to keep the returned row as a matrix/array.
[1] 1 3
Now we can see that the row is seen as a 1 \(\times\) 3 matrix by R.
Calculating statistics
We can apply a function across the rows or columns of a matrix to calculate the mean or standard deviation for example using the function apply()
. The arguments that can be given to apply()
include:
X =
: this is the matrix (or array) we want to apply the function to.MARGIN =
: this tells R whether we want to apply the function to the rows or the columns. A value of1
means the function will be applied to the rows and2
means the function will be applied to each column.FUN =
: this is the function we want to apply. It can be things like the mean (mean
), median (median
), or standard deviation (sd
).
For example, if we wanted to know the mean births for each city across the three years, we can use the following code.
Edinburgh Glasgow Aberdeen Dundee
4871.667 6651.000 2333.000 1466.000
What is the standard deviation for the number of births in 2019?
Vector/Matrix multiplication
R can be used to complete vector and matrix multiplication. The operator used for this is %*%
. For example, for the following matrix and vector,
\[ \boldsymbol{X}=\begin{bmatrix}2&4&-1\\3&2&2\\1&2&-1\end{bmatrix},\,\,\,\,\boldsymbol{y}=\begin{bmatrix}1\\1\\3\end{bmatrix} \]
The solution to \(\boldsymbol{X}\times\boldsymbol{y}\) can be found using the following code.
X <- matrix(data = c(2, 4, -1, 3, 2, 2, 1, 2, -1), nrow = 3, byrow = TRUE)
y <- matrix(data = c(1, 1, 3), nrow = 3, byrow = TRUE)
X%*%y
[,1]
[1,] 3
[2,] 11
[3,] 0
If instead we wanted to calculate \(\boldsymbol{y}^\intercal\times\boldsymbol{X}\) we would first have to transpose y
. Matrices can be transposed using the function t()
.
[,1] [,2] [,3]
[1,] 8 12 -2
R can also be used to solve a system of equations. For example, we have just seen that,
\[ \begin{aligned} (2\times 1)+(4\times 1)+(-1\times 3)&=3\\ (3\times 1)+(2\times 1)+(2\times 3)&=11\\ (1\times 1)+(2\times 1)+(-1\times 3)&=0 \end{aligned} \]
But suppose we didn't know the vector \(\boldsymbol{y}\), and instead were given the system of equations,
\[ \begin{aligned} 2x+4y-z&=3\\ 3x+2y+2z&=11\\ x+2y-z&=0 \end{aligned} \]
This can also be represented as,
\[ \boldsymbol{Xy}=\boldsymbol{z},\,\mbox{ where } \boldsymbol{X}=\begin{bmatrix}2&4&-1\\3&2&2\\1&2&-1\end{bmatrix},\,\,\,\,\boldsymbol{y}=\begin{bmatrix}x\\y\\z\end{bmatrix},\mbox{ and }\boldsymbol{z}=\begin{bmatrix}3\\11\\0\end{bmatrix} \]
We can then use R to solve this system of equations using the solve()
function. We need to give solve()
the matrix of coefficients, \(\boldsymbol{X}\), and the vector \(\boldsymbol{z}\).
[,1]
[1,] 1
[2,] 1
[3,] 3
This then gives the solution \(\boldsymbol{y}=\begin{bmatrix}1&1&3\end{bmatrix}^\intercal\), which is what we expected to see.
It is also possible to use the solve()
function to simply find the inverse of a matrix. This is done by providing it with only one argument - the matrix to be inverted. For example, the code below returns the inverse of the matrix \(X\).
[,1] [,2] [,3]
[1,] -1.50 0.50 2.50
[2,] 1.25 -0.25 -1.75
[3,] 1.00 0.00 -2.00
We can see that,
\[ \boldsymbol{X}^{-1}=\begin{bmatrix}-1.50&0.50&2.50\\1.25&-0.25&-1.75\\1.00&0.00&-2.00\end{bmatrix} \]
There are additional details and examples of using matrices in R in Sections 1.9.1 Arrays and Matrices and 1.9.2 Vector and Matrix Operations of Probability and Statistics with R.
3.3 Factors
Factors are similar to vectors in R, however they have additional information and are used to store categorical data, for example someone's gender or marriage status. They record the "levels" of the categorical variable stored within the vector which each numerical value corresponds to.
For example, suppose you are interested in the qualification level of several university alumni. You might use a simple encoding of 1
="Bachelor's degree", 2
="Master's degree" and 3
="PhD" to record these data.
This might give us data that looks like the following vector degree
.
[1] 1 1 2 1 3
By itself, degree
is not very informative about what level of degree each student was awarded. We can fill in the rest of this information by changing degree1
from a vector
to a factor
.
To create a factor we use the function factor()
. This function can be given the following arguments:
x =
: the vector of data which we want to define categories for.levels =
: this is a vector of all possible values that the elements inx
can take.labels =
: this is a vector containing the names of each level of the category.
For example, to convert the vector degree1
into a factor called degree_factor1
we use:
degree_factor1 <- factor(x = degree1, levels = 1:3,
labels = c("Bachelor's", "Master's", "PhD"))
degree_factor1
[1] Bachelor's Bachelor's Master's Bachelor's PhD
Levels: Bachelor's Master's PhD
If instead, degree
had been a character vector stating the level of degree awarded, we can still turn this into a factor so that R knows this is categorical data and there are only three levels we are interested in.
degree2 <- c("Bachelor's", "Bachelor's", "Master's", "Bachelor's", "PhD" )
degree_factor2 <- factor(x = degree2, levels = c("Bachelor's", "Master's", "PhD"))
degree_factor2
[1] Bachelor's Bachelor's Master's Bachelor's PhD
Levels: Bachelor's Master's PhD
You can also change the labels of the levels used within a pre-existing factor using the levels()
function.
[1] BSc BSc MSc BSc PhD
Levels: BSc MSc PhD
The results from a survey asking students whether statistics is the best subject are shown below. They were given a choice of "Agree", "Disagree" and "Unsure".
Student | Answer |
---|---|
Student 1 | Agree |
Student 2 | Agree |
Student 3 | Agree |
Student 4 | Unsure |
Student 5 | Disgaree |
Create and print a factor, called survey
, which contains the answers of these five students as well as the levels of response they could have given.
Creating factors in R is covered in Section 1.9.3 Factors of Probability and Statistics with R.
3.4 Data frames
Data frames in R are very similar to matrices. The key difference however is that whilst all elements in a matrix must be of the same "mode" (e.g. numeric, character, logical), each column in a data frame can be of a different mode. If you needed to store a numeric vector, a logical vector and a character vector that all relate to the same subjects for example, then a data frame is the way to do this.
Data frames are a very common type of data structure used within R. Most of the data you will see saved in packages, or the data you will use for fitting statistical models will be saved in a data frame.
To create a data frame, we can use the function data.frame()
. The only arguments needed are the pre-existing vectors, which all need to be of the same length, that you want to save within the data frame. Some additional arguments include:
stringAsFactor =
: this takes the valuesTRUE
orFALSE
and tells R whether any character vectors should be turned into factors. If this argument is excluded, R takes the default value to beFALSE
, so character vectors will kept as they are.row.names =
: this can be a vector of names you wish to use for the rows of the data frame. By default, R will just number the rows starting from 1.
We can create a data frame storing information about students' performance in a course using the following code.
percentage <- c(84, 76, 90, 53, 6, 67)
grade <- c("A", "A", "A", "C", "H", "B")
pass <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE)
performance <- data.frame(percentage, grade, pass)
performance
percentage | grade | pass |
---|---|---|
84 | A | TRUE |
76 | A | TRUE |
90 | A | TRUE |
53 | C | TRUE |
6 | H | FALSE |
67 | B | TRUE |
We can see that within the data frame performance
, percentage
is a numeric vector, grade
is a character vector and pass
is a logical vector by using the function str()
.
'data.frame': 6 obs. of 3 variables:
$ percentage: num 84 76 90 53 6 67
$ grade : chr "A" "A" "A" "C" ...
$ pass : logi TRUE TRUE TRUE TRUE FALSE TRUE
We could change grade
to be a factor by adding stringsAsFactor = TRUE
within the data.frame()
function. We can also create a vector of students' IDs and use this to name the rows of the data frame in the following code.
ids <- c("ST002", "ST014", "ST089", "ST060", "ST034", "ST056")
performance <- data.frame(percentage, grade, pass,
stringsAsFactors = TRUE, row.names = ids)
performance
percentage | grade | pass | |
---|---|---|---|
ST002 | 84 | A | TRUE |
ST014 | 76 | A | TRUE |
ST089 | 90 | A | TRUE |
ST060 | 53 | C | TRUE |
ST034 | 6 | H | FALSE |
ST056 | 67 | B | TRUE |
Now if we use the str()
function, we can see that grade
is treated as a factor.
'data.frame': 6 obs. of 3 variables:
$ percentage: num 84 76 90 53 6 67
$ grade : Factor w/ 4 levels "A","B","C","H": 1 1 1 3 4 2
$ pass : logi TRUE TRUE TRUE TRUE FALSE TRUE
Elements from data frames can be extracted in a couple of ways. For example, we can use square brackets [ ]
, or we can use the dollar sign operator $
.
For example, if we wanted to extract just the vector pass
from the data frame performance
, we can use any of the following code.
[1] TRUE TRUE TRUE TRUE FALSE TRUE
[1] TRUE TRUE TRUE TRUE FALSE TRUE
[1] TRUE TRUE TRUE TRUE FALSE TRUE
When we use square brackets, we also need to specify which rows we want to exclude as the first entry (before the comma) within the square brackets. In the code above, we haven't specified any rows, so R shows us all of the rows from performance
.
Write code to extract only the percentage and the associated grade for the student with ID ST014
?
An alternative way to easily extract columns from a data frame is to use the attach()
function. The only argument needed here is the data frame you want to attach to something called the 'search path' in R. This just means that you no longer need to type in the name of the data frame to access its columns.
Before we do this, we are going to run the following code to remove the original vectors we created from the Environment tab using the rm()
function. We do this so that R doesn't just show us these pre-existing vectors directly, but instead looks within the data frame performance
.
[1] A A A C H B
Levels: A B C H
The attach()
function is useful if you are going to be using the same data frame over and over, but it is good practice to 'detach' it using the detach()
function once you no longer need the data frame.
When conducting statistical analysis, you won't often need to create your own data frame of information from scratch - it will most likely already exist in some format somewhere! One place where data might be stored is in the packages you can install and load into R.
We have already installed and loaded the package PASWR2
in Section 2. We can now see a list of the data frames stored in the PASWR2
package by using the following code.
We can see more information about the data stored in any of these data sets using the help()
function. For example, if we wanted to know what the data in the data frame RAT
related to, we could use the following code.
To view an extract of this data frame, we can use the function head()
and provide as an argument the name of the data frame. This will show us the first 6 rows of a data frame by default.
survival.time |
---|
152 |
152 |
115 |
109 |
137 |
88 |
To save a data frame a package in our own Environment tab, we use the function data()
. This will read in the data frame and means we can use it as we would any other data frame that we had created ourselves.
The package PASWR2
contains a data set called WAIT
. What do the wait times saved in this data set relate to?
Write code to first view the top 5 rows of the data frame WAIT
and then load it into your Environment.
All of this information is covered with further examples in Section 1.9.5 Data Frames of Probability and Statistics with R.
3.5 Lists
Lists are objects in R that bring together elements of different modes (for example character, numeric or logical vectors or even matrices or arrays) into the same object. Lists are created using the list()
function which doesn't have any particular arguments required. Instead, the name of each element is given as well as what this element should be - this could be a vector, a matrix or an array.
For example, we could save information about the movie Titanic in a list using the following code.
titanic <- list(director = "James Cameron",
actors = c("Leonardo DiCaprio", "Kate Winslet"),
runtime = "3 hours 14 minutes",
release.date = "23/01/1998",
budget = 200000000,
gross.profit = 2222985568,
production.companies = c("Twentieth Century Fox",
"Paramount Pictures",
"Lightstorm Entertainment"))
titanic
$director
[1] "James Cameron"
$actors
[1] "Leonardo DiCaprio" "Kate Winslet"
$runtime
[1] "3 hours 14 minutes"
$release.date
[1] "23/01/1998"
$budget
[1] 2e+08
$gross.profit
[1] 2222985568
$production.companies
[1] "Twentieth Century Fox" "Paramount Pictures"
[3] "Lightstorm Entertainment"
The elements of a list can be accessed using either double square brackets [[ ]]
, or the $
operator (when the elements are named). For example, if we wanted to extract the release date from titanic
, then we can use any of the following code.
[1] "23/01/1998"
[1] "23/01/1998"
[1] "23/01/1998"
We could also be more specific and extract a particular entry from one of the elements of the list using single square brackets, [ ]
, after the double square brackets or $
operator. For example, if we wanted to know who the second billed actor is, then we can use any of the following lines of code.
[1] "Kate Winslet"
[1] "Kate Winslet"
[1] "Kate Winslet"
If you are unsure of the names of all of the elements of a list, then the names()
function is useful.
[1] "director" "actors" "runtime"
[4] "release.date" "budget" "gross.profit"
[7] "production.companies"
You can see other examples of how lists can be used in Section 1.9.4 Lists of Probability and Statistics with R.