2 Working With Data

2.1 Checking variable types

Once a data frame has been read into R, it is always a good idea to examine its contents using the str() function to see the structure of the data object. We have already seen the str() function in Lab 2 but as a reminder, it shows us the type of vector each column in a data frame is saved as.

Running the following code tells us that the four variables ldl, hdl, trig and age are all integer vectors and that id, gender and smoke are character vectors.

str(chol)

'data.frame':   13 obs. of  7 variables:
 $ id    : chr  "P912" "P215" "P063" "P117" ...
 $ ldl   : int  175 196 139 162 140 147 82 165 149 95 ...
 $ hdl   : int  25 36 65 37 117 51 81 63 49 54 ...
 $ trig  : int  148 92 NA 139 59 126 NA 120 NA 157 ...
 $ age   : int  39 32 42 30 42 65 57 48 32 55 ...
 $ gender: chr  "female" "female" "male" "female" ...
 $ smoke : chr  "no" "no" NA "ex-smoker" ...

Because gender only takes the values "female" or "male" in this case, and smoke is categorised into three levels, "no", "ex-smoker" and "current", it makes sense to treat both these variables as factors instead of character vectors.

We can use what we learned in Lab 2 to change these variables into factors.

chol$gender <- factor(x = chol$gender, levels = c("female", "male"))
chol$smoke <- factor(x = chol$smoke, levels = c("no", "ex-smoker", "current"))

Now using str() to check the type of vector each column is saved as shows us that gender and smoke are both now factors.

str(chol)

'data.frame':   13 obs. of  7 variables:
 $ id    : chr  "P912" "P215" "P063" "P117" ...
 $ ldl   : int  175 196 139 162 140 147 82 165 149 95 ...
 $ hdl   : int  25 36 65 37 117 51 81 63 49 54 ...
 $ trig  : int  148 92 NA 139 59 126 NA 120 NA 157 ...
 $ age   : int  39 32 42 30 42 65 57 48 32 55 ...
 $ gender: Factor w/ 2 levels "female","male": 1 1 2 1 1 1 2 2 1 1 ...
 $ smoke : Factor w/ 3 levels "no","ex-smoker",..: 1 1 NA 2 2 2 1 3 1 2 ...

What type of variable is schools saved as in the education data frame?

Using the str() function shows us that schools is saved as an integer variable.

str(education)

'data.frame':   21 obs. of  5 variables:
 $ year    : int  2016 2016 2016 2017 2017 2017 2018 2018 2018 2019 ...
 $ level   : chr  "ELC" "Primary" "Secondary" "ELC" ...
 $ schools : int  2514 2031 359 2532 2019 360 2544 2012 357 2576 ...
 $ teachers: int  985 23920 22957 921 24477 23150 821 NA 23317 798 ...
 $ pupils  : int  96961 396697 280983 95893 400312 281993 96549 400276 286152 96375 ...

Write some code to change the variables year and level in education to be factor variables.

education$year <- factor(x = education$year,
                         levels = c("2016", "2017", "2018", "2019",
                                    "2020", "2021", "2022"))

education$level <- factor(x = education$level,
                          levels = c("ELC", "Primary", "Secondary"))

Refer to Section 1.11 Working with Data of Probability and Statistics with R to learn more about checking the setup of a data set.

2.2 Dealing with `NA` values

Data sets will often have missing values for a variety of different reasons; maybe because of human error, maybe because information was not disclosed or maybe because of a failed experiment for example. When data is correctly read into R these unknown values will be denoted by NA. In order to conduct analysis or perform calculations on your data, you may wish to remove these missing values from your data set. Always think about whether this is an appropriate thing to do.

One way in which we can remove missing values from a data set is to use the function na.omit(). This will return the data frame with any 'incomplete cases' removed. That is, any rows that have NA as the value for any variable will be removed from the data frame.

Looking at chol, we can see that there are missing values in rows 3, 7 and 9.

	id	ldl	hdl	trig	age	gender	smoke
1	P912	175	25	148	39	female	no
2	P215	196	36	92	32	female	no
3	P063	139	65	NA	42	male	NA
4	P117	162	37	139	30	female	ex-smoker
5	P613	140	117	59	42	female	ex-smoker
6	P332	147	51	126	65	female	ex-smoker
7	P951	82	81	NA	57	male	no
8	P004	165	63	120	48	male	current
9	P725	149	49	NA	32	female	no
10	P901	95	54	157	55	female	ex-smoker
11	P103	169	59	67	48	female	no
12	P843	174	117	168	41	female	no
13	P753	91	52	146	69	female	current

If we run the following code, then these rows are removed from the data frame and we are left with only the 'complete cases'.

na.omit(chol)

	id	ldl	hdl	trig	age	gender	smoke
1	P912	175	25	148	39	female	no
2	P215	196	36	92	32	female	no
4	P117	162	37	139	30	female	ex-smoker
5	P613	140	117	59	42	female	ex-smoker
6	P332	147	51	126	65	female	ex-smoker
8	P004	165	63	120	48	male	current
10	P901	95	54	157	55	female	ex-smoker
11	P103	169	59	67	48	female	no
12	P843	174	117	168	41	female	no
13	P753	91	52	146	69	female	current

Note that na.omit() preserves the original row labels. This means that there are no rows labelled 3, 7 or 9 in the resulting data frame because they have been completely removed.

complete.cases() is another useful function that can be used to remove rows that have NA values. This returns a logical vector, the same length as the number of rows of the data frame, that indicates whether a row contains any NA values (FALSE), or whether it is 'complete' (TRUE).

complete.cases(chol)

 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
[13]  TRUE

Again we can see that the rows with missing values in chol are rows 3, 7 and 9 (since the third, seventh and ninth values in the output above are all FALSE). We can then use this logical vector to extract the rows which are complete from chol.

chol[complete.cases(chol), ]

	id	ldl	hdl	trig	age	gender	smoke
1	P912	175	25	148	39	female	no
2	P215	196	36	92	32	female	no
4	P117	162	37	139	30	female	ex-smoker
5	P613	140	117	59	42	female	ex-smoker
6	P332	147	51	126	65	female	ex-smoker
8	P004	165	63	120	48	male	current
10	P901	95	54	157	55	female	ex-smoker
11	P103	169	59	67	48	female	no
12	P843	174	117	168	41	female	no
13	P753	91	52	146	69	female	current

Here, using na.omit() and complete.cases() have returned the same output.

Which rows in education have missing values?

Using complete.cases() shows us that rows 8 and 16 of education are incomplete and therefore contain NA values.

complete.cases(education)

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Write code to remove all rows in education which contain NA values.

Incomplete observations can be removed from education using any of the following lines of code.

na.omit(education)
education[complete.cases(education), ]
education[!is.na(education$teachers), ]

In the case where we only want to know which entries of a vector or specific variable in a data frame are NA, we can use the function is.na(). For example, if missing values in the trig variable were not of concern but we wanted to identify missing values in the smoke column, we could use the following code.

is.na(chol$smoke)

 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE

We can see that only the third row has the value NA for smoke, since the third element in the output from is.na() above is TRUE. In order to remove the row where smoke has a missing value, we can use the following code to index the chol data frame.

chol[!is.na(chol$smoke), ]

	id	ldl	hdl	trig	age	gender	smoke
1	P912	175	25	148	39	female	no
2	P215	196	36	92	32	female	no
4	P117	162	37	139	30	female	ex-smoker
5	P613	140	117	59	42	female	ex-smoker
6	P332	147	51	126	65	female	ex-smoker
7	P951	82	81	NA	57	male	no
8	P004	165	63	120	48	male	current
9	P725	149	49	NA	32	female	no
10	P901	95	54	157	55	female	ex-smoker
11	P103	169	59	67	48	female	no
12	P843	174	117	168	41	female	no
13	P753	91	52	146	69	female	current

Note that we use ! in front of is.na() so that the logical vector returned has the value TRUE when values are complete and FALSE when values are missing i.e. NA.

You can look at further examples of dealing with missing data in Section 1.11.1 Dealing with NA Values of Probability and Statistics with R.

2.3 Sorting data frames

When investigating your data sets, you may want to order the values of a particular variable in increasing or decreasing order. This is easily done using the sort() function.

For example, we can view the ages of all subjects in chol, in increasing order, using the code below.

sort(chol$age)

 [1] 30 32 32 39 41 42 42 48 48 55 57 65 69

Note that if we wanted to view these ages in decreasing order, we would add the argument decreasing = TRUE to the sort() function.

What is the largest value for pupils from the education data frame?

sort(education$pupils, decreasing = TRUE)[1]

[1] 400312

The downside of using sort() is that we can only see the values from one variable of a data frame. If instead we wanted to order all subjects in chol from the youngest to the oldest and still see the values of all the other variables, we can use the function order().

order() will return a vector showing which row has the smallest value, then the second smallest value and so on. For example, the following code shows us that the fourth subject in chol is the youngest and the thirteenth subject is the eldest.

order(chol$age)

 [1]  4  2  9  1 12  3  5  8 11 10  7  6 13

We can then use this vector to index the full data frame chol and see all the variables for each subject at once.

chol[order(chol$age), ]

	id	ldl	hdl	trig	age	gender	smoke
4	P117	162	37	139	30	female	ex-smoker
2	P215	196	36	92	32	female	no
9	P725	149	49	NA	32	female	no
1	P912	175	25	148	39	female	no
12	P843	174	117	168	41	female	no
3	P063	139	65	NA	42	male	NA
5	P613	140	117	59	42	female	ex-smoker
8	P004	165	63	120	48	male	current
11	P103	169	59	67	48	female	no
10	P901	95	54	157	55	female	ex-smoker
7	P951	82	81	NA	57	male	no
6	P332	147	51	126	65	female	ex-smoker
13	P753	91	52	146	69	female	current

In the output above, note that there are multiple subjects aged 32, 42 and 48. After ordering by age, R automatically shows these subjects with the same age in order of increasing row number. We could however add a second or third argument to order() to order the rows by another variable in the case where there are repeated values of the first variable.

For example, the following code orders all the subjects in chol by age first, and then for any subjects that are the same age, they will then be sorted in order of increasing ldl.

chol[order(chol$age, chol$ldl), ]

	id	ldl	hdl	trig	age	gender	smoke
4	P117	162	37	139	30	female	ex-smoker
9	P725	149	49	NA	32	female	no
2	P215	196	36	92	32	female	no
1	P912	175	25	148	39	female	no
12	P843	174	117	168	41	female	no
3	P063	139	65	NA	42	male	NA
5	P613	140	117	59	42	female	ex-smoker
8	P004	165	63	120	48	male	current
11	P103	169	59	67	48	female	no
10	P901	95	54	157	55	female	ex-smoker
7	P951	82	81	NA	57	male	no
6	P332	147	51	126	65	female	ex-smoker
13	P753	91	52	146	69	female	current

Write code to sort the observations from education in decreasing order of the number of pupils.

We need to include the argument decreasing = TRUE within the function order() so that the observations are ordered from largest number of pupils to the smallest number of pupils. We can use the order() function within square brackets to show all variables in the data frame in order of decreasing number of pupils.

education[order(education$pupils, decreasing = TRUE), ]

	year	level	schools	teachers	pupils
5	2017	Primary	2019	24477	400312
8	2018	Primary	2012	NA	400276
11	2019	Primary	2004	25027	398794
2	2016	Primary	2031	23920	396697
14	2020	Primary	2005	25651	393957
17	2021	Primary	2001	25807	390313
20	2022	Primary	1994	25451	388920
21	2022	Secondary	358	24874	309133
18	2021	Secondary	357	24782	306811
15	2020	Secondary	357	24077	300954
12	2019	Secondary	358	23522	292063
9	2018	Secondary	357	23317	286152
6	2017	Secondary	360	23150	281993
3	2016	Secondary	359	22957	280983
1	2016	ELC	2514	985	96961
7	2018	ELC	2544	821	96549
10	2019	ELC	2576	798	96375
4	2017	ELC	2532	921	95893
19	2022	ELC	2606	734	92615
16	2021	ELC	2630	NA	91603
13	2020	ELC	2587	729	90126

Look at Section 1.11.3 Sorting a Data Frame by One or More of Its Columns of Probability and Statistics with R to learn more about sorting and ordering data sets.

2.4 Subsetting

When we want to only view particular elements of a data frame, this is known as subsetting the data. This is useful if you're dealing with extremely large data sets and only want to analyse female subjects, or subjects who are all from the same country for example. Subsetting the data means that you would extract only these subjects that you are actually interested in.

A useful function for extracting elements of a data frame is the function subset() (which we first saw in Lab 1). This allows us to extract the elements of a data frame which meet particular conditions. The arguments that subset() takes are:

x =: this is the data frame that we want to extract particular elements from.
subset =: this is a logical statement which determines the elements to keep in the subsetted data frame.
select =: this shows the column or columns from the data frame which the logical statement should be applied to.

For example, if we wanted to view the subjects in chol who have an LDL of greater than 170, then we can use the following code.

subset(x = chol, subset = ldl > 170, select = ldl)

	ldl
1	175
2	196
12	174

This shows us that there are three patients with LDL greater than 170 (subset = ldl > 170) and we can also see the values of LDL for these patients (select = ldl).

If we wanted to see the values of the other variables in the data frame for only those patients with LDL greater than 170, then we can simply leave out the select = argument.

subset(x = chol, subset = ldl > 170)

	id	ldl	hdl	trig	age	gender	smoke
1	P912	175	25	148	39	female	no
2	P215	196	36	92	32	female	no
12	P843	174	117	168	41	female	no

Note that it is also possible to subset a data frame using logical statements within square brackets, [ ]. We could return the same output as above by indexing the chol using the following code.

chol[chol$ldl > 170, ]

	id	ldl	hdl	trig	age	gender	smoke
1	P912	175	25	148	39	female	no
2	P215	196	36	92	32	female	no
12	P843	174	117	168	41	female	no

Write some code to subset education to show the number of schools that have a collective total of more than 310,000 pupils in the years 2020, 2021 or 2022.

The data frame that we want to subset is education, so this is what we'll feed in to the argument x =.

Since the question asks us to look for a collective total of more than 310,000 pupils, this means we want to only see the rows where the value for pupils is greater than 310,000. We also only want to see rows from the years 2020, 2021 or 2022. Because year is a factor, we need to specify each level that we are interested in. This means that we are looking for rows in which pupils > 310000 AND year == "2020" or year == "2021" or year == "2021". This is quite a lengthy logical statement in the following code.

The question also asks us to only show the number of schools for which these statements are true i.e. the column schools. To do this, we simply feed this variable to the select = argument.

subset(x = education,
       subset = pupils > 310000 & year == "2020" | 
         pupils > 310000 & year == "2021" | 
         pupils > 310000 & year == "2022",
       select = schools)

A way we can shorten the logical statement in the subset = argument is to use the operator %in%. This will search for values in a vector and return the rows in which any of these values appear.

subset(x = education,
       subset = pupils > 310000 & year %in% c("2020", "2021", "2022"),
       select = schools)

	schools
14	2005
17	2001
20	1994

You can read more about subsetting data frames in Section 1.12 Using Logical Operators with Data Frames in Probability and Statistics with R.

2.5 Summarising data

Data sets will often contain a lot of information which is not easy to interpret at a glance. It is therefore useful to be able to summarise the data they contain, in appropriate ways for each different type of variable.

One of the simplest functions to help summarise a data frame is the summary() function.

summary(chol)

      id                 ldl             hdl           trig      
 Length:13          Min.   : 82.0   Min.   : 25   Min.   : 59.0  
 Class :character   1st Qu.:139.0   1st Qu.: 49   1st Qu.: 99.0  
 Mode  :character   Median :149.0   Median : 54   Median :132.5  
                    Mean   :144.9   Mean   : 62   Mean   :122.2  
                    3rd Qu.:169.0   3rd Qu.: 65   3rd Qu.:147.5  
                    Max.   :196.0   Max.   :117   Max.   :168.0  
                                                  NA's   :3      
      age           gender         smoke  
 Min.   :30.00   female:10   no       :6  
 1st Qu.:39.00   male  : 3   ex-smoker:4  
 Median :42.00               current  :2  
 Mean   :46.15               NA's     :1  
 3rd Qu.:55.00                            
 Max.   :69.00

The output from summary() shows information for each column in the data frame you provide as the argument. For numerical variables, we are shown summary statistics such as the minimum value, the mean or the 3rd quartile. For factor variables, we are shown how many observations there are in each level of the factor. If there are any NA values in a column, the total number of these will also be shown for each variable.

When a data frame contains categorical variables, a neater way to summarise the counts of the different levels is in contingency tables. These show counts of how many times each level of a categorical variable appeared in the data frame. The function to create contingency tables in R is table(). The only argument that table() needs is the factor variable you want to summarise.

For example, we can quickly show counts of how many subjects in chol fall into each of the three levels of the smoke variable using the following code.

table(chol$smoke)


       no ex-smoker   current 
        6         4         2

If we wanted to further split these counts by the variable gender, then we simply add this as a second argument to the table() function.

smoke_counts <- table(chol$smoke, chol$gender)
smoke_counts

           
            female male
  no             5    1
  ex-smoker      4    0
  current        1    1

We can easily compute the sums of rows or columns in a table using the function margin.table(). Here we need to provide margin.table() with the following arguments:

x =: this is the table you want to sum over.
margin =: this tells R whether you want to sum over rows (set the value to 1), or columns (set the value to 2).

For example, we can use the table smoke_counts, created above, to count the number of female and male subjects for whom we know their smoking status, using margin.table().

margin.table(x = smoke_counts, margin = 2)


female   male 
    10      2

Another useful function to use with tables is prop.table(). This takes the same arguments as margin.table() but shows row or column proportions, rather than sums.

For example, to calculate the proportions of current smokers, ex-smokers and non-smokers that are female and male, we can use the following code.

prop.table(x = smoke_counts, margin = 1)

           
               female      male
  no        0.8333333 0.1666667
  ex-smoker 1.0000000 0.0000000
  current   0.5000000 0.5000000

Suppose you wanted to calculate summary statistics for one variable in a data frame, but have it split by the levels of a different categorical variable.

The function in R which calculates a summary statistic for one numeric variable, split by the levels of a factor is tapply(). The arguments that tapply() can take are as follows:

X =: this is the numeric variable that you want to apply the function calculating some summary statistic to.
INDEX =: this is a list containing the categorical variable (or variables) you want to split the calculation of the summary statistic across.
FUN =: this is the name of the function you want to apply to the numeric variable. Examples include mean, median, max, min, mode, sd etc.

In the case where we are interested in knowing the mean HDL for subjects who were current smokers, subjects who were ex-smokers and subjects who were non-smokers, we can use tapply().

tapply(X = chol$hdl, INDEX = list(chol$smoke), FUN = mean)

       no ex-smoker   current 
 61.16667  64.75000  57.50000

We can see, for example, that the mean HDL for non-smokers is 61.17.

The list provided to the INDEX = argument can contain more than one categorical variable. For example, we can calculate the mean HDL of females and males for each level of the smoke variable using the following code.

tapply(X = chol$hdl, INDEX = list(chol$smoke, chol$gender), FUN = mean)

          female male
no         57.20   81
ex-smoker  64.75   NA
current    52.00   63

Now we can see that the mean HDL for females who are non-smokers is 57.20. The mean HDL for males who are ex-smokers is NA because there are no males included in chol who are ex-smokers.

What is the mean total number of teachers in primary schools across all years?

In order to find this value we want to use the function tapply(). teachers is the column we want to calculate the mean for, but make sure to split this by the different levels in the level column.

teachers contains some NA values, which when passed to the function mean will return another NA value unless you provide to tapply() the additional argument na.rm = TRUE. This tells R to ignore the NA values when calculating the mean and only use those rows which have a numerical value.

tapply(X = education$teachers, INDEX = list(education$level), FUN = mean, na.rm = TRUE)

       ELC    Primary  Secondary 
  831.3333 25055.5000 23811.2857

To read more on creating tables and summarising data in R, see Sections 1.13 Tables and 1.14 Summarizing Functions in Probability and Statistics with R.

2.6 Creating variables

In the case where we have another vector or data frame that we wish to join to an existing one, we can do this using one of the functions cbind() or rbind().

cbind() combines the vectors or data frames together by making additional columns, whereas
rbind() combines them by adding the new vector or data frame as additional rows.

Let's see an example to understand how this works. The file measurements.csv contains information on the heights and weights of all 13 patients in the original chol data frame. We can begin by reading it in to the Environment tab using the following code.

measurements <- read.csv(file = "measurements.csv")

We can then add measurements to chol as two additional columns and save the resulting data frame as chol_full using the code below.

chol_full <- cbind(chol, measurements)
head(chol_full)

id	ldl	hdl	trig	age	gender	smoke	weight	height
P912	175	25	148	39	female	no	90.77	1.69
P215	196	36	92	32	female	no	75.06	1.75
P063	139	65	NA	42	male	NA	73.99	1.84
P117	162	37	139	30	female	ex-smoker	86.25	1.83
P613	140	117	59	42	female	ex-smoker	76.95	1.81
P332	147	51	126	65	female	ex-smoker	57.66	1.75

Another way to easily create a new variable in a data frame is to use the $ operator. We can simply add the name of the data frame to the left of $ and our new variable name to the right. Then we can set this variable to be any pre-existing vector, or calculate a new vector based on variables from the data frame.

For example, if we wanted to create a new variable, bmi, in chol which shows the BMI of each patient, then we can use the following code.

chol_full$bmi <- chol_full$weight/(chol_full$height)^2
head(chol_full)

id	ldl	hdl	trig	age	gender	smoke	weight	height	bmi
P912	175	25	148	39	female	no	90.77	1.69	31.78110
P215	196	36	92	32	female	no	75.06	1.75	24.50939
P063	139	65	NA	42	male	NA	73.99	1.84	21.85432
P117	162	37	139	30	female	ex-smoker	86.25	1.83	25.75473
P613	140	117	59	42	female	ex-smoker	76.95	1.81	23.48829
P332	147	51	126	65	female	ex-smoker	57.66	1.75	18.82776

In the education data frame, create a new variable called ratio which calculates the pupil to teacher ratio in each level of education. That is,

\[\mbox{ratio}=\textstyle\frac{\mbox{puils}}{\mbox{teachers}}\]

education$ratio <- education$pupils/education$teachers

Now suppose that information on a fourteenth subject is known but has not been included in the original chol data frame. This data is shown in Table 2.1 below.

Table 2.1: Cholesterol data for the fourteenth patient.
id	ldl	hdl	trig	age	gender	smoke	weight	height
P461	148	78	120	41	male	current	84.05	1.79

In this case we can add the new subject as an additional row using the rbind() function.

First, we need to create a data frame containing the information for this subject. In order for us to add this data frame as a row to chol_full, it needs to have the same number of variables. Therefore, we also need to calculate the BMI for this subject and call it bmi. We can do all this with the following code.

subject <- data.frame(id = "P461", ldl = 148, hdl = 78, trig = 120, age = 41, 
                      gender = "male", smoke = "current", weight = 84.05, height = 1.79)

subject$bmi <- subject$weight/(subject$height)^2

Now we can add this subject to chol_full using the code below.

chol_full <- rbind(chol_full, subject)
tail(chol_full)

	id	ldl	hdl	trig	age	gender	smoke	weight	height	bmi
9	P725	149	49	NA	32	female	no	65.37	1.67	23.43935
10	P901	95	54	157	55	female	ex-smoker	80.34	1.62	30.61271
11	P103	169	59	67	48	female	no	74.90	1.61	28.89549
12	P843	174	117	168	41	female	no	63.78	1.77	20.35813
13	P753	91	52	146	69	female	current	71.58	1.62	27.27481
14	P461	148	78	120	41	male	current	84.05	1.79	26.23202

Note that tail() is a function very similar to head(), but rather than showing the first 6 rows by default, it shows the last 6.

Sections 1.11.2 Creating New Variables in a Data Frame and 1.13 Tables of Probability and Statistics with R describe how to create new variables.

See Appendix A.1 to learn how to create a new variable in a data frame by breaking an exisiting variable into different levels.