A Additional Material
A.1 cut()
For numeric variables in a data frame, it can sometimes be useful to split the values into intervals and create a new factor with numerical levels. For example, if we wanted to identify high, mid and low levels of hdl
in chol_full
.
The function that can do this in R is cut()
. The arguments that cut()
takes are:
x =
: this is the numeric variable that you want to split into different levels.breaks =
: this is the number of levels you want to split the numeric vector into.include.lowest =
: this takes the valuesTRUE
orFALSE
, indicating whether the lowest value in the numeric variable should be included in the first level. By default this is set toFALSE
so this smallest value is not included.
We can split hdl
into three levels using the following code.
[1] [24.9,55.7] [24.9,55.7] (55.7,86.3] [24.9,55.7] (86.3,117] [24.9,55.7]
[7] (55.7,86.3] (55.7,86.3] [24.9,55.7] [24.9,55.7] (55.7,86.3] (86.3,117]
[13] [24.9,55.7] (55.7,86.3]
Levels: [24.9,55.7] (55.7,86.3] (86.3,117]
This tells us that the lowest level is the range [24.9, 55.7], the middle level is (55.7, 86.3] and the highest level is (86.3, 117]. We can also see which level each row falls into, the first two rows being in the low level for hdl
, the third row being in the middle level and so on.
We can then add this as a new factor variable, hdl_level
, and represent each level with the labels "low"
, "mid"
and "high"
using the code below.
chol_full$hdl_level <- factor(cut(x = chol_full$hdl, breaks = 3, include.lowest = TRUE),
labels = c("low", "mid", "high"))
head(chol_full)
id | ldl | hdl | trig | age | gender | smoke | weight | height | bmi | hdl_level |
---|---|---|---|---|---|---|---|---|---|---|
P912 | 175 | 25 | 148 | 39 | female | no | 90.77 | 1.69 | 31.78110 | low |
P215 | 196 | 36 | 92 | 32 | female | no | 75.06 | 1.75 | 24.50939 | low |
P063 | 139 | 65 | NA | 42 | male | NA | 73.99 | 1.84 | 21.85432 | mid |
P117 | 162 | 37 | 139 | 30 | female | ex-smoker | 86.25 | 1.83 | 25.75473 | low |
P613 | 140 | 117 | 59 | 42 | female | ex-smoker | 76.95 | 1.81 | 23.48829 | high |
P332 | 147 | 51 | 126 | 65 | female | ex-smoker | 57.66 | 1.75 | 18.82776 | low |
A.2 Merging data frames
Sometimes information relating to the same subjects or observations might be stored in two separate data frames. When this is the case it is easy to combine two data frames using the merge()
function. merge()
takes the following arguments:
x =
: this is the first of the two data frames you want to merge together.y =
: this is the second of the two data frames you want to merge.by.x =
: this specifies which column in the first data frame should be used to merge. This is usually an identifying variable such as subjects' names or ID codes.by.y =
: this specifies which column in the second data frame should be used to merge.all =
: this takes valuesTRUE
orFALSE
, indicating whether all rows from both data frames should be included.all.x =
: this takes valuesTRUE
andFALSE
, indicating whether extra rows should be created in the second data frame so that all rows in the first data frame are kept.all.y =
: this takes valuesTRUE
andFALSE
, indicating whether extra rows should be created in the first data frame so that all rows in the second data frame are kept.
Only the arguments x =
and y =
are required. When any of of the by. =
arguments are left out of the function, R will automatically look for columns which share the same name in the two data sets. When any of the all. =
arguments are left out, they default to FALSE
, so only complete cases are kept in the final merged data frame.
The file treatment.csv contains information on whether patients in a study testing a new treatment for high cholesterol were given the new drug or a placebo drug with no effect. Some of the patients in this new study are subjects from the chol
data frame. We can read in the file treatment.csv and merge it with chol_full
in order to see all the information available on a subject.
To start with, we need to read in treatment.csv and save this as a data frame called treatment
.
Then we can merge chol_full
and treatment
into a single data frame called patients
using the following code.
patients <- merge(x = chol_full, y = treatment, by.x = "id", by.y = "patient_id",
all = TRUE)
head(patients[, -c(1:3)])
trig | age | gender | smoke | weight | height | bmi | hdl_level | treatment |
---|---|---|---|---|---|---|---|---|
120 | 48 | male | current | 99.02 | 1.70 | 34.26298 | mid | NA |
NA | 42 | male | NA | 73.99 | 1.84 | 21.85432 | mid | Treatment |
NA | NA | NA | NA | NA | NA | NA | NA | Treatment |
67 | 48 | female | no | 74.90 | 1.61 | 28.89549 | mid | Treatment |
NA | NA | NA | NA | NA | NA | NA | NA | Treatment |
139 | 30 | female | ex-smoker | 86.25 | 1.83 | 25.75473 | low | Placebo |
Because the column showing the patient ID has a different name in chol_full
and treatment
, we have had to specify what it is called in each data frame here using by.x =
and by.y =
(make sure to check the contents of your data frames to notice things like this!). The argument all = TRUE
means that we are keeping all information from both data frames, regardless of whether a patient only appears in chol_full
or only in treatment
. This is why in the excerpt of patients
above, there are rows where the value for all variables except treatment
are NA
.
The file class.csv contains information on the average primary class size in the years 2016 - 2022. Read this file into R and save it as a data frame called class
.
Merge the information from the data frames education
and class
together into a new data frame called primary
, showing all variables from education
and the average class size for primary schools only. Look carefully at which row names these two data frames have in common.
To read the file class.csv in to R, we can use the following code.
In order to merge the two data frames, we want to use the function merge()
. The data frame we provide to the argument x =
is education
and the data frame for the y =
argument is class
.
Because we want to match up the rows with the same year and the same level of education, we need to give the argument by =
a vector of these two variables. We can use the argument by =
, rather than by.x =
and by.y =
, because the columns have the same names in both data frames.
Finally, since we only want to show the rows for primary schools, we can specify all.y =TRUE
. This will keep all the rows from the second data frame, class
, and delete the rows from the first data frame which don't have a matching row in the second. For example, because there is no information on the average class size in secondary schools in 2016 in class
, this row from education
will not appear in primary
.
year | level | schools | teachers | pupils | size |
---|---|---|---|---|---|
2016 | Primary | 2031 | 23920 | 396697 | 23.5 |
2017 | Primary | 2019 | 24477 | 400312 | 23.5 |
2018 | Primary | 2012 | NA | 400276 | 23.5 |
2019 | Primary | 2004 | 25027 | 398794 | 23.5 |
2020 | Primary | 2005 | 25651 | 393957 | 23.1 |
2021 | Primary | 2001 | 25807 | 390313 | 23.2 |
2022 | Primary | 1994 | 25451 | 388920 | 23.3 |
For more information on merging data set, see Section 1.11.4 Merging Data Frames of Probability and Statistics with R.
A.3 Flow control
Flow control is the term used to describe the order that your code is carried out. Usually this just happens line by line, but there may be cases where you want to repeat and update certain code over and over. To save you time having to type out a lot of repeating code, there are functions in R which can help you to repeat certain operations.
The for()
function allows you to repeat lines of code for a given number of repetitions. Sometimes using the for()
function is referred to as a for loop because once you have run multiple lines of code, it loops back to the beginning and does it all again. The setup is:
name
: this is the name you want to use for the index in each iteration of the repeated statements. Most commonly this is given the valuei
.vector
: this is a vector which is the same length as the number of times you want to repeat the statements.statements
: these are the lines of code you want to repeat a number of times.
In order to fully understand how the for()
function works, let's look at an example. We could use the for()
function to print the numbers 1 up to 5.
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Here, we have assigned the value i
to the name
argument and the vector is 1:5
. The :
operator returns the sequence of numbers from 1 to 5 in this case.
The for loop will first of all give i
the value 1 and then run the statements within { }
, so here it will run print(1)
. print()
is a function we haven't seen yet, but it just 'prints' the arguments to the R console so we will see the value 1 written there.
Next, the for loop will update the value of i
to the next one given in the vector
argument i.e. it will assign the value 2 to i
. Then it will again run print(2)
and update the value of i
again. This will repeat until i
has taken all the values given in the vector
argument.
Complete the following code to sum together the numbers 1 to 12.
sum <-
for(i in
){
sum <- sum +
}
We start by creating the vector sum
to which each value 1, 2, ..., 12 can be added. Initially it needs to take the value 0.
Within the function for()
, we want i
to, in turn, take each value 1, 2, ..., 12, so we need to provide a vector of these values (1:12
). sum
should then be updated each time i
takes a new value, by adding it on to the old value of sum
.
This for loop starts with sum
having the value 0. It will first assign 1 to i
and execute the code sum <- 0 + 1
, meaning sum
now has the value 1.
i
will then be updated to take the value 2 and the for loop will run the codesum <- 1 + 2
i.e.sum
has the value 3.i
will then be updated to take the value 3 and the for loop will run the codesum <- 3 + 3
i.e.sum
has the value 6.\(\vdots\)
This repeats until finally i
is assigned the value 12 and the for loop updates sum
for the last time.
Using the above code, what is the value of \(1+2+3+...+12\)?
You might not always know how many times you want to repeat certain lines of code. When this is the case the while()
function can be used to repeat code while a given condition is satisfied. Once this condition is no longer TRUE
, the loop will stop. The while()
function has a similar setup to for loops.
condition
: this is logical statement that can take the valuesTRUE
orFALSE
. When it is true, the statements will be repeated. When it isFALSE
, the statements will not be evaluated.statements
: these are the lines of code you want to repeat a number of times.
We could use a while loop to calculate the sum of the numbers 0, 1, 2, 3, 4, 5, ... up until their total first goes over 100. The code below does this by first creating the vectors i
, which lists all the values we want to sum together, and total
, to keep track of the sum of all the values in i
. Initially i
only contains the value 0 and total
is also set to 0.
Because we only want to sum together the values in i
until total
first goes over 100, the condition we provide within while()
is total < 100
which means the while loop will continue to run only while the value of total
is less than 100.
Because i
starts as only 0 and total
is set to 0, the first iteration of the while loop updates i
to be \(\begin{bmatrix}0&1 \end{bmatrix}^\intercal\) and then it updates total
to now also be 1 (since this is the sum of all the values currently in i
). The second iteration then extends i
to be \(\begin{bmatrix}0&1&2 \end{bmatrix}^\intercal\) and updates total
again to be \(0+1+2=3\).
Within the while loop, we continuously extend i
by adding the next number in the sequence to the end of it. This is done using the function max()
which looks at all values in a vector and returns the maximum. We then update total
to be the sum of all the values in i
. This is done by using the function sum()
which sums together all values in a numeric vector. This continues until total
reaches a value which is greater than 100, at which point the while loop stops.
The final line prints the final versions of i
and total
so we can see that the vector i
contains the values 0 up to 14 and that the sum of all these values, given by total
is 105.
i <- 0
total <- 0
while(total < 100){
i <- c(i, max(i)+1)
total <- sum(i)
}
list("i" = i, "total" = total)
$i
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
$total
[1] 105
If statements are another useful form of flow control. They have a very similar set up to the for()
and while()
functions.
condition
: this is a logical statement that can take the valuesTRUE
orFALSE
. If the condition isTRUE
then the statements in the{ }
immediately after will be evaluated.else
: this is an optional part of the if statement. If there is code you wish to run when the condition isFALSE
, it is placed in the{ }
after this statement.
For example, we can write code to tell us whether a random value generated from the standard normal distribution lies in the region [-1, 1] using the following code.
x <- rnorm(n = 1, mean = 0, sd = 1)
if(x >= -1 & x <= 1){
print("x is in the region [-1, 1]")
} else {
print("x is not in the region [-1, 1]")
}
[1] 0.7808893
[1] "x is in the region [-1, 1]"
What would the value of y
be after running the following if statement? Try to answer without running the code yourself.
Running the code above updates the value of y
to be -16.
You can read more about for loops, while loops and if statements in Section 1.16 Flow Control of Probability and Statistics with R.