2 Working With Data

2.1 Checking variable types

Once a data frame has been read into R, it is always a good idea to examine its contents using the str() function to see the structure of the data object. We have already seen the str() function in Lab 2 but as a reminder, it shows us the type of vector each column in a data frame is saved as.

Running the following code tells us that the four variables ldl, hdl, trig and age are all integer vectors and that id, gender and smoke are character vectors.

str(chol)
'data.frame':   13 obs. of  7 variables:
 $ id    : chr  "P912" "P215" "P063" "P117" ...
 $ ldl   : int  175 196 139 162 140 147 82 165 149 95 ...
 $ hdl   : int  25 36 65 37 117 51 81 63 49 54 ...
 $ trig  : int  148 92 NA 139 59 126 NA 120 NA 157 ...
 $ age   : int  39 32 42 30 42 65 57 48 32 55 ...
 $ gender: chr  "female" "female" "male" "female" ...
 $ smoke : chr  "no" "no" NA "ex-smoker" ...

Because gender only takes the values "female" or "male" in this case, and smoke is categorised into three levels, "no", "ex-smoker" and "current", it makes sense to treat both these variables as factors instead of character vectors.

We can use what we learned in Lab 2 to change these variables into factors.

chol$gender <- factor(x = chol$gender, levels = c("female", "male"))
chol$smoke <- factor(x = chol$smoke, levels = c("no", "ex-smoker", "current"))

Now using str() to check the type of vector each column is saved as shows us that gender and smoke are both now factors.

str(chol)
'data.frame':   13 obs. of  7 variables:
 $ id    : chr  "P912" "P215" "P063" "P117" ...
 $ ldl   : int  175 196 139 162 140 147 82 165 149 95 ...
 $ hdl   : int  25 36 65 37 117 51 81 63 49 54 ...
 $ trig  : int  148 92 NA 139 59 126 NA 120 NA 157 ...
 $ age   : int  39 32 42 30 42 65 57 48 32 55 ...
 $ gender: Factor w/ 2 levels "female","male": 1 1 2 1 1 1 2 2 1 1 ...
 $ smoke : Factor w/ 3 levels "no","ex-smoker",..: 1 1 NA 2 2 2 1 3 1 2 ...

What type of variable is schools saved as in the education data frame?

Using the str() function shows us that schools is saved as an integer variable.

str(education)
'data.frame':   21 obs. of  5 variables:
 $ year    : int  2016 2016 2016 2017 2017 2017 2018 2018 2018 2019 ...
 $ level   : chr  "ELC" "Primary" "Secondary" "ELC" ...
 $ schools : int  2514 2031 359 2532 2019 360 2544 2012 357 2576 ...
 $ teachers: int  985 23920 22957 921 24477 23150 821 NA 23317 798 ...
 $ pupils  : int  96961 396697 280983 95893 400312 281993 96549 400276 286152 96375 ...

Write some code to change the variables year and level in education to be factor variables.

education$year <- factor(x = education$year,
                         levels = c("2016", "2017", "2018", "2019",
                                    "2020", "2021", "2022"))

education$level <- factor(x = education$level,
                          levels = c("ELC", "Primary", "Secondary"))

Refer to Section 1.11 Working with Data of Probability and Statistics with R to learn more about checking the setup of a data set.

2.2 Dealing with NA values

Data sets will often have missing values for a variety of different reasons; maybe because of human error, maybe because information was not disclosed or maybe because of a failed experiment for example. When data is correctly read into R these unknown values will be denoted by NA. In order to conduct analysis or perform calculations on your data, you may wish to remove these missing values from your data set. Always think about whether this is an appropriate thing to do.

One way in which we can remove missing values from a data set is to use the function na.omit(). This will return the data frame with any 'incomplete cases' removed. That is, any rows that have NA as the value for any variable will be removed from the data frame.

Looking at chol, we can see that there are missing values in rows 3, 7 and 9.

id ldl hdl trig age gender smoke
1 P912 175 25 148 39 female no
2 P215 196 36 92 32 female no
3 P063 139 65 NA 42 male NA
4 P117 162 37 139 30 female ex-smoker
5 P613 140 117 59 42 female ex-smoker
6 P332 147 51 126 65 female ex-smoker
7 P951 82 81 NA 57 male no
8 P004 165 63 120 48 male current
9 P725 149 49 NA 32 female no
10 P901 95 54 157 55 female ex-smoker
11 P103 169 59 67 48 female no
12 P843 174 117 168 41 female no
13 P753 91 52 146 69 female current

If we run the following code, then these rows are removed from the data frame and we are left with only the 'complete cases'.

na.omit(chol)
id ldl hdl trig age gender smoke
1 P912 175 25 148 39 female no
2 P215 196 36 92 32 female no
4 P117 162 37 139 30 female ex-smoker
5 P613 140 117 59 42 female ex-smoker
6 P332 147 51 126 65 female ex-smoker
8 P004 165 63 120 48 male current
10 P901 95 54 157 55 female ex-smoker
11 P103 169 59 67 48 female no
12 P843 174 117 168 41 female no
13 P753 91 52 146 69 female current

Note that na.omit() preserves the original row labels. This means that there are no rows labelled 3, 7 or 9 in the resulting data frame because they have been completely removed.

complete.cases() is another useful function that can be used to remove rows that have NA values. This returns a logical vector, the same length as the number of rows of the data frame, that indicates whether a row contains any NA values (FALSE), or whether it is 'complete' (TRUE).

complete.cases(chol)
 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
[13]  TRUE

Again we can see that the rows with missing values in chol are rows 3, 7 and 9 (since the third, seventh and ninth values in the output above are all FALSE). We can then use this logical vector to extract the rows which are complete from chol.

chol[complete.cases(chol), ]
id ldl hdl trig age gender smoke
1 P912 175 25 148 39 female no
2 P215 196 36 92 32 female no
4 P117 162 37 139 30 female ex-smoker
5 P613 140 117 59 42 female ex-smoker
6 P332 147 51 126 65 female ex-smoker
8 P004 165 63 120 48 male current
10 P901 95 54 157 55 female ex-smoker
11 P103 169 59 67 48 female no
12 P843 174 117 168 41 female no
13 P753 91 52 146 69 female current

Here, using na.omit() and complete.cases() have returned the same output.

Which rows in education have missing values?

Using complete.cases() shows us that rows 8 and 16 of education are incomplete and therefore contain NA values.

complete.cases(education)
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Write code to remove all rows in education which contain NA values.

Incomplete observations can be removed from education using any of the following lines of code.

na.omit(education)
education[complete.cases(education), ]
education[!is.na(education$teachers), ]

In the case where we only want to know which entries of a vector or specific variable in a data frame are NA, we can use the function is.na(). For example, if missing values in the trig variable were not of concern but we wanted to identify missing values in the smoke column, we could use the following code.

is.na(chol$smoke)
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE

We can see that only the third row has the value NA for smoke, since the third element in the output from is.na() above is TRUE. In order to remove the row where smoke has a missing value, we can use the following code to index the chol data frame.

chol[!is.na(chol$smoke), ]
id ldl hdl trig age gender smoke
1 P912 175 25 148 39 female no
2 P215 196 36 92 32 female no
4 P117 162 37 139 30 female ex-smoker
5 P613 140 117 59 42 female ex-smoker
6 P332 147 51 126 65 female ex-smoker
7 P951 82 81 NA 57 male no
8 P004 165 63 120 48 male current
9 P725 149 49 NA 32 female no
10 P901 95 54 157 55 female ex-smoker
11 P103 169 59 67 48 female no
12 P843 174 117 168 41 female no
13 P753 91 52 146 69 female current

Note that we use ! in front of is.na() so that the logical vector returned has the value TRUE when values are complete and FALSE when values are missing i.e. NA.


You can look at further examples of dealing with missing data in Section 1.11.1 Dealing with NA Values of Probability and Statistics with R.

2.3 Sorting data frames

When investigating your data sets, you may want to order the values of a particular variable in increasing or decreasing order. This is easily done using the sort() function.

For example, we can view the ages of all subjects in chol, in increasing order, using the code below.

sort(chol$age)
 [1] 30 32 32 39 41 42 42 48 48 55 57 65 69

Note that if we wanted to view these ages in decreasing order, we would add the argument decreasing = TRUE to the sort() function.

What is the largest value for pupils from the education data frame?

sort(education$pupils, decreasing = TRUE)[1]
[1] 400312

The downside of using sort() is that we can only see the values from one variable of a data frame. If instead we wanted to order all subjects in chol from the youngest to the oldest and still see the values of all the other variables, we can use the function order().

order() will return a vector showing which row has the smallest value, then the second smallest value and so on. For example, the following code shows us that the fourth subject in chol is the youngest and the thirteenth subject is the eldest.

order(chol$age)
 [1]  4  2  9  1 12  3  5  8 11 10  7  6 13

We can then use this vector to index the full data frame chol and see all the variables for each subject at once.

chol[order(chol$age), ]
id ldl hdl trig age gender smoke
4 P117 162 37 139 30 female ex-smoker
2 P215 196 36 92 32 female no
9 P725 149 49 NA 32 female no
1 P912 175 25 148 39 female no
12 P843 174 117 168 41 female no
3 P063 139 65 NA 42 male NA
5 P613 140 117 59 42 female ex-smoker
8 P004 165 63 120 48 male current
11 P103 169 59 67 48 female no
10 P901 95 54 157 55 female ex-smoker
7 P951 82 81 NA 57 male no
6 P332 147 51 126 65 female ex-smoker
13 P753 91 52 146 69 female current

In the output above, note that there are multiple subjects aged 32, 42 and 48. After ordering by age, R automatically shows these subjects with the same age in order of increasing row number. We could however add a second or third argument to order() to order the rows by another variable in the case where there are repeated values of the first variable.

For example, the following code orders all the subjects in chol by age first, and then for any subjects that are the same age, they will then be sorted in order of increasing ldl.

chol[order(chol$age, chol$ldl), ]
id ldl hdl trig age gender smoke
4 P117 162 37 139 30 female ex-smoker
9 P725 149 49 NA 32 female no
2 P215 196 36 92 32 female no
1 P912 175 25 148 39 female no
12 P843 174 117 168 41 female no
3 P063 139 65 NA 42 male NA
5 P613 140 117 59 42 female ex-smoker
8 P004 165 63 120 48 male current
11 P103 169 59 67 48 female no
10 P901 95 54 157 55 female ex-smoker
7 P951 82 81 NA 57 male no
6 P332 147 51 126 65 female ex-smoker
13 P753 91 52 146 69 female current

Write code to sort the observations from education in decreasing order of the number of pupils.

We need to include the argument decreasing = TRUE within the function order() so that the observations are ordered from largest number of pupils to the smallest number of pupils. We can use the order() function within square brackets to show all variables in the data frame in order of decreasing number of pupils.

education[order(education$pupils, decreasing = TRUE), ]
year level schools teachers pupils
5 2017 Primary 2019 24477 400312
8 2018 Primary 2012 NA 400276
11 2019 Primary 2004 25027 398794
2 2016 Primary 2031 23920 396697
14 2020 Primary 2005 25651 393957
17 2021 Primary 2001 25807 390313
20 2022 Primary 1994 25451 388920
21 2022 Secondary 358 24874 309133
18 2021 Secondary 357 24782 306811
15 2020 Secondary 357 24077 300954
12 2019 Secondary 358 23522 292063
9 2018 Secondary 357 23317 286152
6 2017 Secondary 360 23150 281993
3 2016 Secondary 359 22957 280983
1 2016 ELC 2514 985 96961
7 2018 ELC 2544 821 96549
10 2019 ELC 2576 798 96375
4 2017 ELC 2532 921 95893
19 2022 ELC 2606 734 92615
16 2021 ELC 2630 NA 91603
13 2020 ELC 2587 729 90126

Look at Section 1.11.3 Sorting a Data Frame by One or More of Its Columns of Probability and Statistics with R to learn more about sorting and ordering data sets.

2.4 Subsetting

When we want to only view particular elements of a data frame, this is known as subsetting the data. This is useful if you're dealing with extremely large data sets and only want to analyse female subjects, or subjects who are all from the same country for example. Subsetting the data means that you would extract only these subjects that you are actually interested in.

A useful function for extracting elements of a data frame is the function subset() (which we first saw in Lab 1). This allows us to extract the elements of a data frame which meet particular conditions. The arguments that subset() takes are:

  • x =: this is the data frame that we want to extract particular elements from.
  • subset =: this is a logical statement which determines the elements to keep in the subsetted data frame.
  • select =: this shows the column or columns from the data frame which the logical statement should be applied to.

For example, if we wanted to view the subjects in chol who have an LDL of greater than 170, then we can use the following code.

subset(x = chol, subset = ldl > 170, select = ldl)
ldl
1 175
2 196
12 174

This shows us that there are three patients with LDL greater than 170 (subset = ldl > 170) and we can also see the values of LDL for these patients (select = ldl).

If we wanted to see the values of the other variables in the data frame for only those patients with LDL greater than 170, then we can simply leave out the select = argument.

subset(x = chol, subset = ldl > 170)
id ldl hdl trig age gender smoke
1 P912 175 25 148 39 female no
2 P215 196 36 92 32 female no
12 P843 174 117 168 41 female no

Note that it is also possible to subset a data frame using logical statements within square brackets, [ ]. We could return the same output as above by indexing the chol using the following code.

chol[chol$ldl > 170, ]
id ldl hdl trig age gender smoke
1 P912 175 25 148 39 female no
2 P215 196 36 92 32 female no
12 P843 174 117 168 41 female no

Write some code to subset education to show the number of schools that have a collective total of more than 310,000 pupils in the years 2020, 2021 or 2022.

The data frame that we want to subset is education, so this is what we'll feed in to the argument x =.

Since the question asks us to look for a collective total of more than 310,000 pupils, this means we want to only see the rows where the value for pupils is greater than 310,000. We also only want to see rows from the years 2020, 2021 or 2022. Because year is a factor, we need to specify each level that we are interested in. This means that we are looking for rows in which pupils > 310000 AND year == "2020" or year == "2021" or year == "2021". This is quite a lengthy logical statement in the following code.

The question also asks us to only show the number of schools for which these statements are true i.e. the column schools. To do this, we simply feed this variable to the select = argument.

subset(x = education,
       subset = pupils > 310000 & year == "2020" | 
         pupils > 310000 & year == "2021" | 
         pupils > 310000 & year == "2022",
       select = schools)

A way we can shorten the logical statement in the subset = argument is to use the operator %in%. This will search for values in a vector and return the rows in which any of these values appear.

subset(x = education,
       subset = pupils > 310000 & year %in% c("2020", "2021", "2022"),
       select = schools)
schools
14 2005
17 2001
20 1994

You can read more about subsetting data frames in Section 1.12 Using Logical Operators with Data Frames in Probability and Statistics with R.

2.5 Summarising data

Data sets will often contain a lot of information which is not easy to interpret at a glance. It is therefore useful to be able to summarise the data they contain, in appropriate ways for each different type of variable.

One of the simplest functions to help summarise a data frame is the summary() function.

summary(chol)
      id                 ldl             hdl           trig      
 Length:13          Min.   : 82.0   Min.   : 25   Min.   : 59.0  
 Class :character   1st Qu.:139.0   1st Qu.: 49   1st Qu.: 99.0  
 Mode  :character   Median :149.0   Median : 54   Median :132.5  
                    Mean   :144.9   Mean   : 62   Mean   :122.2  
                    3rd Qu.:169.0   3rd Qu.: 65   3rd Qu.:147.5  
                    Max.   :196.0   Max.   :117   Max.   :168.0  
                                                  NA's   :3      
      age           gender         smoke  
 Min.   :30.00   female:10   no       :6  
 1st Qu.:39.00   male  : 3   ex-smoker:4  
 Median :42.00               current  :2  
 Mean   :46.15               NA's     :1  
 3rd Qu.:55.00                            
 Max.   :69.00                            
                                          

The output from summary() shows information for each column in the data frame you provide as the argument. For numerical variables, we are shown summary statistics such as the minimum value, the mean or the 3rd quartile. For factor variables, we are shown how many observations there are in each level of the factor. If there are any NA values in a column, the total number of these will also be shown for each variable.

When a data frame contains categorical variables, a neater way to summarise the counts of the different levels is in contingency tables. These show counts of how many times each level of a categorical variable appeared in the data frame. The function to create contingency tables in R is table(). The only argument that table() needs is the factor variable you want to summarise.

For example, we can quickly show counts of how many subjects in chol fall into each of the three levels of the smoke variable using the following code.

table(chol$smoke)

       no ex-smoker   current 
        6         4         2 

If we wanted to further split these counts by the variable gender, then we simply add this as a second argument to the table() function.

smoke_counts <- table(chol$smoke, chol$gender)
smoke_counts
           
            female male
  no             5    1
  ex-smoker      4    0
  current        1    1

We can easily compute the sums of rows or columns in a table using the function margin.table(). Here we need to provide margin.table() with the following arguments:

  • x =: this is the table you want to sum over.
  • margin =: this tells R whether you want to sum over rows (set the value to 1), or columns (set the value to 2).

For example, we can use the table smoke_counts, created above, to count the number of female and male subjects for whom we know their smoking status, using margin.table().

margin.table(x = smoke_counts, margin = 2)

female   male 
    10      2 

Another useful function to use with tables is prop.table(). This takes the same arguments as margin.table() but shows row or column proportions, rather than sums.

For example, to calculate the proportions of current smokers, ex-smokers and non-smokers that are female and male, we can use the following code.

prop.table(x = smoke_counts, margin = 1)
           
               female      male
  no        0.8333333 0.1666667
  ex-smoker 1.0000000 0.0000000
  current   0.5000000 0.5000000

Suppose you wanted to calculate summary statistics for one variable in a data frame, but have it split by the levels of a different categorical variable.

The function in R which calculates a summary statistic for one numeric variable, split by the levels of a factor is tapply(). The arguments that tapply() can take are as follows:

  • X =: this is the numeric variable that you want to apply the function calculating some summary statistic to.
  • INDEX =: this is a list containing the categorical variable (or variables) you want to split the calculation of the summary statistic across.
  • FUN =: this is the name of the function you want to apply to the numeric variable. Examples include mean, median, max, min, mode, sd etc.

In the case where we are interested in knowing the mean HDL for subjects who were current smokers, subjects who were ex-smokers and subjects who were non-smokers, we can use tapply().

tapply(X = chol$hdl, INDEX = list(chol$smoke), FUN = mean)
       no ex-smoker   current 
 61.16667  64.75000  57.50000 

We can see, for example, that the mean HDL for non-smokers is 61.17.

The list provided to the INDEX = argument can contain more than one categorical variable. For example, we can calculate the mean HDL of females and males for each level of the smoke variable using the following code.

tapply(X = chol$hdl, INDEX = list(chol$smoke, chol$gender), FUN = mean)
          female male
no         57.20   81
ex-smoker  64.75   NA
current    52.00   63

Now we can see that the mean HDL for females who are non-smokers is 57.20. The mean HDL for males who are ex-smokers is NA because there are no males included in chol who are ex-smokers.

What is the mean total number of teachers in primary schools across all years?

In order to find this value we want to use the function tapply(). teachers is the column we want to calculate the mean for, but make sure to split this by the different levels in the level column.

teachers contains some NA values, which when passed to the function mean will return another NA value unless you provide to tapply() the additional argument na.rm = TRUE. This tells R to ignore the NA values when calculating the mean and only use those rows which have a numerical value.

tapply(X = education$teachers, INDEX = list(education$level), FUN = mean, na.rm = TRUE)
       ELC    Primary  Secondary 
  831.3333 25055.5000 23811.2857 

To read more on creating tables and summarising data in R, see Sections 1.13 Tables and 1.14 Summarizing Functions in Probability and Statistics with R.

2.6 Creating variables

In the case where we have another vector or data frame that we wish to join to an existing one, we can do this using one of the functions cbind() or rbind().

  • cbind() combines the vectors or data frames together by making additional columns, whereas

  • rbind() combines them by adding the new vector or data frame as additional rows.

Let's see an example to understand how this works. The file measurements.csv contains information on the heights and weights of all 13 patients in the original chol data frame. We can begin by reading it in to the Environment tab using the following code.

measurements <- read.csv(file = "measurements.csv")

We can then add measurements to chol as two additional columns and save the resulting data frame as chol_full using the code below.

chol_full <- cbind(chol, measurements)
head(chol_full)
id ldl hdl trig age gender smoke weight height
P912 175 25 148 39 female no 90.77 1.69
P215 196 36 92 32 female no 75.06 1.75
P063 139 65 NA 42 male NA 73.99 1.84
P117 162 37 139 30 female ex-smoker 86.25 1.83
P613 140 117 59 42 female ex-smoker 76.95 1.81
P332 147 51 126 65 female ex-smoker 57.66 1.75

Another way to easily create a new variable in a data frame is to use the $ operator. We can simply add the name of the data frame to the left of $ and our new variable name to the right. Then we can set this variable to be any pre-existing vector, or calculate a new vector based on variables from the data frame.

For example, if we wanted to create a new variable, bmi, in chol which shows the BMI of each patient, then we can use the following code.

chol_full$bmi <- chol_full$weight/(chol_full$height)^2
head(chol_full)
id ldl hdl trig age gender smoke weight height bmi
P912 175 25 148 39 female no 90.77 1.69 31.78110
P215 196 36 92 32 female no 75.06 1.75 24.50939
P063 139 65 NA 42 male NA 73.99 1.84 21.85432
P117 162 37 139 30 female ex-smoker 86.25 1.83 25.75473
P613 140 117 59 42 female ex-smoker 76.95 1.81 23.48829
P332 147 51 126 65 female ex-smoker 57.66 1.75 18.82776

In the education data frame, create a new variable called ratio which calculates the pupil to teacher ratio in each level of education. That is,

\[\mbox{ratio}=\textstyle\frac{\mbox{puils}}{\mbox{teachers}}\]

education$ratio <- education$pupils/education$teachers

Now suppose that information on a fourteenth subject is known but has not been included in the original chol data frame. This data is shown in Table 2.1 below.

Table 2.1: Cholesterol data for the fourteenth patient.
id ldl hdl trig age gender smoke weight height
P461 148 78 120 41 male current 84.05 1.79

In this case we can add the new subject as an additional row using the rbind() function.

First, we need to create a data frame containing the information for this subject. In order for us to add this data frame as a row to chol_full, it needs to have the same number of variables. Therefore, we also need to calculate the BMI for this subject and call it bmi. We can do all this with the following code.

subject <- data.frame(id = "P461", ldl = 148, hdl = 78, trig = 120, age = 41, 
                      gender = "male", smoke = "current", weight = 84.05, height = 1.79)

subject$bmi <- subject$weight/(subject$height)^2

Now we can add this subject to chol_full using the code below.

chol_full <- rbind(chol_full, subject)
tail(chol_full)
id ldl hdl trig age gender smoke weight height bmi
9 P725 149 49 NA 32 female no 65.37 1.67 23.43935
10 P901 95 54 157 55 female ex-smoker 80.34 1.62 30.61271
11 P103 169 59 67 48 female no 74.90 1.61 28.89549
12 P843 174 117 168 41 female no 63.78 1.77 20.35813
13 P753 91 52 146 69 female current 71.58 1.62 27.27481
14 P461 148 78 120 41 male current 84.05 1.79 26.23202

Note that tail() is a function very similar to head(), but rather than showing the first 6 rows by default, it shows the last 6.


Sections 1.11.2 Creating New Variables in a Data Frame and 1.13 Tables of Probability and Statistics with R describe how to create new variables.

See Appendix A.1 to learn how to create a new variable in a data frame by breaking an exisiting variable into different levels.