2 Working With Data
2.1 Checking variable types
Once a data frame has been read into R, it is always a good idea to examine its contents using the str()
function to see the structure of the data object. We have already seen the str()
function in Lab 2 but as a reminder, it shows us the type of vector each column in a data frame is saved as.
Running the following code tells us that the four variables ldl
, hdl
, trig
and age
are all integer vectors and that id
, gender
and smoke
are character vectors.
'data.frame': 13 obs. of 7 variables:
$ id : chr "P912" "P215" "P063" "P117" ...
$ ldl : int 175 196 139 162 140 147 82 165 149 95 ...
$ hdl : int 25 36 65 37 117 51 81 63 49 54 ...
$ trig : int 148 92 NA 139 59 126 NA 120 NA 157 ...
$ age : int 39 32 42 30 42 65 57 48 32 55 ...
$ gender: chr "female" "female" "male" "female" ...
$ smoke : chr "no" "no" NA "ex-smoker" ...
Because gender
only takes the values "female"
or "male"
in this case, and smoke
is categorised into three levels, "no"
, "ex-smoker"
and "current"
, it makes sense to treat both these variables as factors instead of character vectors.
We can use what we learned in Lab 2 to change these variables into factors.
chol$gender <- factor(x = chol$gender, levels = c("female", "male"))
chol$smoke <- factor(x = chol$smoke, levels = c("no", "ex-smoker", "current"))
Now using str()
to check the type of vector each column is saved as shows us that gender
and smoke
are both now factors.
'data.frame': 13 obs. of 7 variables:
$ id : chr "P912" "P215" "P063" "P117" ...
$ ldl : int 175 196 139 162 140 147 82 165 149 95 ...
$ hdl : int 25 36 65 37 117 51 81 63 49 54 ...
$ trig : int 148 92 NA 139 59 126 NA 120 NA 157 ...
$ age : int 39 32 42 30 42 65 57 48 32 55 ...
$ gender: Factor w/ 2 levels "female","male": 1 1 2 1 1 1 2 2 1 1 ...
$ smoke : Factor w/ 3 levels "no","ex-smoker",..: 1 1 NA 2 2 2 1 3 1 2 ...
What type of variable is schools
saved as in the education
data frame?
Using the str()
function shows us that schools
is saved as an integer variable.
'data.frame': 21 obs. of 5 variables:
$ year : int 2016 2016 2016 2017 2017 2017 2018 2018 2018 2019 ...
$ level : chr "ELC" "Primary" "Secondary" "ELC" ...
$ schools : int 2514 2031 359 2532 2019 360 2544 2012 357 2576 ...
$ teachers: int 985 23920 22957 921 24477 23150 821 NA 23317 798 ...
$ pupils : int 96961 396697 280983 95893 400312 281993 96549 400276 286152 96375 ...
Write some code to change the variables year
and level
in education
to be factor variables.
Refer to Section 1.11 Working with Data of Probability and Statistics with R to learn more about checking the setup of a data set.
2.2 Dealing with NA
Data sets will often have missing values for a variety of different reasons; maybe because of human error, maybe because information was not disclosed or maybe because of a failed experiment for example. When data is correctly read into R these unknown values will be denoted by NA
. In order to conduct analysis or perform calculations on your data, you may wish to remove these missing values from your data set. Always think about whether this is an appropriate thing to do.
One way in which we can remove missing values from a data set is to use the function na.omit()
. This will return the data frame with any 'incomplete cases' removed. That is, any rows that have NA
as the value for any variable will be removed from the data frame.
Looking at chol
, we can see that there are missing values in rows 3, 7 and 9.
id | ldl | hdl | trig | age | gender | smoke | |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
3 | P063 | 139 | 65 | NA | 42 | male | NA |
4 | P117 | 162 | 37 | 139 | 30 | female | ex-smoker |
5 | P613 | 140 | 117 | 59 | 42 | female | ex-smoker |
6 | P332 | 147 | 51 | 126 | 65 | female | ex-smoker |
7 | P951 | 82 | 81 | NA | 57 | male | no |
8 | P004 | 165 | 63 | 120 | 48 | male | current |
9 | P725 | 149 | 49 | NA | 32 | female | no |
10 | P901 | 95 | 54 | 157 | 55 | female | ex-smoker |
11 | P103 | 169 | 59 | 67 | 48 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
13 | P753 | 91 | 52 | 146 | 69 | female | current |
If we run the following code, then these rows are removed from the data frame and we are left with only the 'complete cases'.
id | ldl | hdl | trig | age | gender | smoke | |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
4 | P117 | 162 | 37 | 139 | 30 | female | ex-smoker |
5 | P613 | 140 | 117 | 59 | 42 | female | ex-smoker |
6 | P332 | 147 | 51 | 126 | 65 | female | ex-smoker |
8 | P004 | 165 | 63 | 120 | 48 | male | current |
10 | P901 | 95 | 54 | 157 | 55 | female | ex-smoker |
11 | P103 | 169 | 59 | 67 | 48 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
13 | P753 | 91 | 52 | 146 | 69 | female | current |
Note that na.omit()
preserves the original row labels. This means that there are no rows labelled 3, 7 or 9 in the resulting data frame because they have been completely removed.
is another useful function that can be used to remove rows that have NA
values. This returns a logical vector, the same length as the number of rows of the data frame, that indicates whether a row contains any NA
values (FALSE
), or whether it is 'complete' (TRUE
[13] TRUE
Again we can see that the rows with missing values in chol
are rows 3, 7 and 9 (since the third, seventh and ninth values in the output above are all FALSE
). We can then use this logical vector to extract the rows which are complete from chol
id | ldl | hdl | trig | age | gender | smoke | |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
4 | P117 | 162 | 37 | 139 | 30 | female | ex-smoker |
5 | P613 | 140 | 117 | 59 | 42 | female | ex-smoker |
6 | P332 | 147 | 51 | 126 | 65 | female | ex-smoker |
8 | P004 | 165 | 63 | 120 | 48 | male | current |
10 | P901 | 95 | 54 | 157 | 55 | female | ex-smoker |
11 | P103 | 169 | 59 | 67 | 48 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
13 | P753 | 91 | 52 | 146 | 69 | female | current |
Here, using na.omit()
and complete.cases()
have returned the same output.
Which rows in education
have missing values?
Write code to remove all rows in education
which contain NA
In the case where we only want to know which entries of a vector or specific variable in a data frame are NA
, we can use the function is.na()
. For example, if missing values in the trig
variable were not of concern but we wanted to identify missing values in the smoke
column, we could use the following code.
[13] FALSE
We can see that only the third row has the value NA
for smoke
, since the third element in the output from is.na()
above is TRUE
. In order to remove the row where smoke
has a missing value, we can use the following code to index the chol
data frame.
id | ldl | hdl | trig | age | gender | smoke | |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
4 | P117 | 162 | 37 | 139 | 30 | female | ex-smoker |
5 | P613 | 140 | 117 | 59 | 42 | female | ex-smoker |
6 | P332 | 147 | 51 | 126 | 65 | female | ex-smoker |
7 | P951 | 82 | 81 | NA | 57 | male | no |
8 | P004 | 165 | 63 | 120 | 48 | male | current |
9 | P725 | 149 | 49 | NA | 32 | female | no |
10 | P901 | 95 | 54 | 157 | 55 | female | ex-smoker |
11 | P103 | 169 | 59 | 67 | 48 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
13 | P753 | 91 | 52 | 146 | 69 | female | current |
Note that we use !
in front of is.na()
so that the logical vector returned has the value TRUE
when values are complete and FALSE
when values are missing i.e. NA
You can look at further examples of dealing with missing data in Section 1.11.1 Dealing with NA
Values of Probability and Statistics with R.
2.3 Sorting data frames
When investigating your data sets, you may want to order the values of a particular variable in increasing or decreasing order. This is easily done using the sort()
For example, we can view the ages of all subjects in chol
, in increasing order, using the code below.
[1] 30 32 32 39 41 42 42 48 48 55 57 65 69
Note that if we wanted to view these ages in decreasing order, we would add the argument decreasing = TRUE
to the sort()
What is the largest value for pupils
from the education
data frame?
The downside of using sort()
is that we can only see the values from one variable of a data frame. If instead we wanted to order all subjects in chol
from the youngest to the oldest and still see the values of all the other variables, we can use the function order()
will return a vector showing which row has the smallest value, then the second smallest value and so on. For example, the following code shows us that the fourth subject in chol
is the youngest and the thirteenth subject is the eldest.
[1] 4 2 9 1 12 3 5 8 11 10 7 6 13
We can then use this vector to index the full data frame chol
and see all the variables for each subject at once.
id | ldl | hdl | trig | age | gender | smoke | |
4 | P117 | 162 | 37 | 139 | 30 | female | ex-smoker |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
9 | P725 | 149 | 49 | NA | 32 | female | no |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
3 | P063 | 139 | 65 | NA | 42 | male | NA |
5 | P613 | 140 | 117 | 59 | 42 | female | ex-smoker |
8 | P004 | 165 | 63 | 120 | 48 | male | current |
11 | P103 | 169 | 59 | 67 | 48 | female | no |
10 | P901 | 95 | 54 | 157 | 55 | female | ex-smoker |
7 | P951 | 82 | 81 | NA | 57 | male | no |
6 | P332 | 147 | 51 | 126 | 65 | female | ex-smoker |
13 | P753 | 91 | 52 | 146 | 69 | female | current |
In the output above, note that there are multiple subjects aged 32, 42 and 48. After ordering by age
, R automatically shows these subjects with the same age in order of increasing row number. We could however add a second or third argument to order()
to order the rows by another variable in the case where there are repeated values of the first variable.
For example, the following code orders all the subjects in chol
by age
first, and then for any subjects that are the same age, they will then be sorted in order of increasing ldl
id | ldl | hdl | trig | age | gender | smoke | |
4 | P117 | 162 | 37 | 139 | 30 | female | ex-smoker |
9 | P725 | 149 | 49 | NA | 32 | female | no |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
3 | P063 | 139 | 65 | NA | 42 | male | NA |
5 | P613 | 140 | 117 | 59 | 42 | female | ex-smoker |
8 | P004 | 165 | 63 | 120 | 48 | male | current |
11 | P103 | 169 | 59 | 67 | 48 | female | no |
10 | P901 | 95 | 54 | 157 | 55 | female | ex-smoker |
7 | P951 | 82 | 81 | NA | 57 | male | no |
6 | P332 | 147 | 51 | 126 | 65 | female | ex-smoker |
13 | P753 | 91 | 52 | 146 | 69 | female | current |
Write code to sort the observations from education
in decreasing order of the number of pupils.
We need to include the argument decreasing = TRUE
within the function order()
so that the observations are ordered from largest number of pupils to the smallest number of pupils. We can use the order()
function within square brackets to show all variables in the data frame in order of decreasing number of pupils.
year | level | schools | teachers | pupils | |
5 | 2017 | Primary | 2019 | 24477 | 400312 |
8 | 2018 | Primary | 2012 | NA | 400276 |
11 | 2019 | Primary | 2004 | 25027 | 398794 |
2 | 2016 | Primary | 2031 | 23920 | 396697 |
14 | 2020 | Primary | 2005 | 25651 | 393957 |
17 | 2021 | Primary | 2001 | 25807 | 390313 |
20 | 2022 | Primary | 1994 | 25451 | 388920 |
21 | 2022 | Secondary | 358 | 24874 | 309133 |
18 | 2021 | Secondary | 357 | 24782 | 306811 |
15 | 2020 | Secondary | 357 | 24077 | 300954 |
12 | 2019 | Secondary | 358 | 23522 | 292063 |
9 | 2018 | Secondary | 357 | 23317 | 286152 |
6 | 2017 | Secondary | 360 | 23150 | 281993 |
3 | 2016 | Secondary | 359 | 22957 | 280983 |
1 | 2016 | ELC | 2514 | 985 | 96961 |
7 | 2018 | ELC | 2544 | 821 | 96549 |
10 | 2019 | ELC | 2576 | 798 | 96375 |
4 | 2017 | ELC | 2532 | 921 | 95893 |
19 | 2022 | ELC | 2606 | 734 | 92615 |
16 | 2021 | ELC | 2630 | NA | 91603 |
13 | 2020 | ELC | 2587 | 729 | 90126 |
Look at Section 1.11.3 Sorting a Data Frame by One or More of Its Columns of Probability and Statistics with R to learn more about sorting and ordering data sets.
2.4 Subsetting
When we want to only view particular elements of a data frame, this is known as subsetting the data. This is useful if you're dealing with extremely large data sets and only want to analyse female subjects, or subjects who are all from the same country for example. Subsetting the data means that you would extract only these subjects that you are actually interested in.
A useful function for extracting elements of a data frame is the function subset()
(which we first saw in Lab 1). This allows us to extract the elements of a data frame which meet particular conditions. The arguments that subset()
takes are:
x =
: this is the data frame that we want to extract particular elements from.subset =
: this is a logical statement which determines the elements to keep in the subsetted data frame.select =
: this shows the column or columns from the data frame which the logical statement should be applied to.
For example, if we wanted to view the subjects in chol
who have an LDL of greater than 170, then we can use the following code.
ldl | |
1 | 175 |
2 | 196 |
12 | 174 |
This shows us that there are three patients with LDL greater than 170 (subset = ldl > 170
) and we can also see the values of LDL for these patients (select = ldl
If we wanted to see the values of the other variables in the data frame for only those patients with LDL greater than 170, then we can simply leave out the select =
id | ldl | hdl | trig | age | gender | smoke | |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
Note that it is also possible to subset a data frame using logical statements within square brackets, [ ]
. We could return the same output as above by indexing the chol
using the following code.
id | ldl | hdl | trig | age | gender | smoke | |
1 | P912 | 175 | 25 | 148 | 39 | female | no |
2 | P215 | 196 | 36 | 92 | 32 | female | no |
12 | P843 | 174 | 117 | 168 | 41 | female | no |
Write some code to subset education
to show the number of schools that have a collective total of more than 310,000 pupils in the years 2020, 2021 or 2022.
The data frame that we want to subset is education
, so this is what we'll feed in to the argument x =
Since the question asks us to look for a collective total of more than 310,000 pupils, this means we want to only see the rows where the value for pupils
is greater than 310,000. We also only want to see rows from the years 2020, 2021 or 2022. Because year
is a factor, we need to specify each level that we are interested in. This means that we are looking for rows in which pupils > 310000
AND year == "2020"
or year == "2021"
or year == "2021"
. This is quite a lengthy logical statement in the following code.
The question also asks us to only show the number of schools for which these statements are true i.e. the column schools
. To do this, we simply feed this variable to the select =
subset(x = education,
subset = pupils > 310000 & year == "2020" |
pupils > 310000 & year == "2021" |
pupils > 310000 & year == "2022",
select = schools)
A way we can shorten the logical statement in the subset =
argument is to use the operator %in%
. This will search for values in a vector and return the rows in which any of these values appear.
subset(x = education,
subset = pupils > 310000 & year %in% c("2020", "2021", "2022"),
select = schools)
schools | |
14 | 2005 |
17 | 2001 |
20 | 1994 |
You can read more about subsetting data frames in Section 1.12 Using Logical Operators with Data Frames in Probability and Statistics with R.
2.5 Summarising data
Data sets will often contain a lot of information which is not easy to interpret at a glance. It is therefore useful to be able to summarise the data they contain, in appropriate ways for each different type of variable.
One of the simplest functions to help summarise a data frame is the summary()
id ldl hdl trig
Length:13 Min. : 82.0 Min. : 25 Min. : 59.0
Class :character 1st Qu.:139.0 1st Qu.: 49 1st Qu.: 99.0
Mode :character Median :149.0 Median : 54 Median :132.5
Mean :144.9 Mean : 62 Mean :122.2
3rd Qu.:169.0 3rd Qu.: 65 3rd Qu.:147.5
Max. :196.0 Max. :117 Max. :168.0
NA's :3
age gender smoke
Min. :30.00 female:10 no :6
1st Qu.:39.00 male : 3 ex-smoker:4
Median :42.00 current :2
Mean :46.15 NA's :1
3rd Qu.:55.00
Max. :69.00
The output from summary()
shows information for each column in the data frame you provide as the argument. For numerical variables, we are shown summary statistics such as the minimum value, the mean or the 3rd quartile. For factor variables, we are shown how many observations there are in each level of the factor. If there are any NA
values in a column, the total number of these will also be shown for each variable.
When a data frame contains categorical variables, a neater way to summarise the counts of the different levels is in contingency tables. These show counts of how many times each level of a categorical variable appeared in the data frame. The function to create contingency tables in R is table()
. The only argument that table()
needs is the factor variable you want to summarise.
For example, we can quickly show counts of how many subjects in chol
fall into each of the three levels of the smoke
variable using the following code.
no ex-smoker current
6 4 2
If we wanted to further split these counts by the variable gender
, then we simply add this as a second argument to the table()
female male
no 5 1
ex-smoker 4 0
current 1 1
We can easily compute the sums of rows or columns in a table using the function margin.table()
. Here we need to provide margin.table()
with the following arguments:
x =
: this is the table you want to sum over.margin =
: this tells R whether you want to sum over rows (set the value to 1), or columns (set the value to 2).
For example, we can use the table smoke_counts
, created above, to count the number of female and male subjects for whom we know their smoking status, using margin.table()
female male
10 2
Another useful function to use with tables is prop.table()
. This takes the same arguments as margin.table()
but shows row or column proportions, rather than sums.
For example, to calculate the proportions of current smokers, ex-smokers and non-smokers that are female and male, we can use the following code.
female male
no 0.8333333 0.1666667
ex-smoker 1.0000000 0.0000000
current 0.5000000 0.5000000
Suppose you wanted to calculate summary statistics for one variable in a data frame, but have it split by the levels of a different categorical variable.
The function in R which calculates a summary statistic for one numeric variable, split by the levels of a factor is tapply()
. The arguments that tapply()
can take are as follows:
X =
: this is the numeric variable that you want to apply the function calculating some summary statistic to.INDEX =
: this is a list containing the categorical variable (or variables) you want to split the calculation of the summary statistic across.FUN =
: this is the name of the function you want to apply to the numeric variable. Examples includemean
In the case where we are interested in knowing the mean HDL for subjects who were current smokers, subjects who were ex-smokers and subjects who were non-smokers, we can use tapply()
no ex-smoker current
61.16667 64.75000 57.50000
We can see, for example, that the mean HDL for non-smokers is 61.17.
The list provided to the INDEX =
argument can contain more than one categorical variable. For example, we can calculate the mean HDL of females and males for each level of the smoke
variable using the following code.
female male
no 57.20 81
ex-smoker 64.75 NA
current 52.00 63
Now we can see that the mean HDL for females who are non-smokers is 57.20. The mean HDL for males who are ex-smokers is NA
because there are no males included in chol
who are ex-smokers.
What is the mean total number of teachers in primary schools across all years?
In order to find this value we want to use the function tapply()
. teachers
is the column we want to calculate the mean
for, but make sure to split this by the different levels in the level
contains some NA
values, which when passed to the function mean
will return another NA
value unless you provide to tapply()
the additional argument na.rm = TRUE
. This tells R to ignore the NA
values when calculating the mean and only use those rows which have a numerical value.
ELC Primary Secondary
831.3333 25055.5000 23811.2857
To read more on creating tables and summarising data in R, see Sections 1.13 Tables and 1.14 Summarizing Functions in Probability and Statistics with R.
2.6 Creating variables
In the case where we have another vector or data frame that we wish to join to an existing one, we can do this using one of the functions cbind()
or rbind()
combines the vectors or data frames together by making additional columns, whereasrbind()
combines them by adding the new vector or data frame as additional rows.
Let's see an example to understand how this works. The file measurements.csv contains information on the heights and weights of all 13 patients in the original chol
data frame. We can begin by reading it in to the Environment tab using the following code.
We can then add measurements
to chol
as two additional columns and save the resulting data frame as chol_full
using the code below.
id | ldl | hdl | trig | age | gender | smoke | weight | height |
P912 | 175 | 25 | 148 | 39 | female | no | 90.77 | 1.69 |
P215 | 196 | 36 | 92 | 32 | female | no | 75.06 | 1.75 |
P063 | 139 | 65 | NA | 42 | male | NA | 73.99 | 1.84 |
P117 | 162 | 37 | 139 | 30 | female | ex-smoker | 86.25 | 1.83 |
P613 | 140 | 117 | 59 | 42 | female | ex-smoker | 76.95 | 1.81 |
P332 | 147 | 51 | 126 | 65 | female | ex-smoker | 57.66 | 1.75 |
Another way to easily create a new variable in a data frame is to use the $
operator. We can simply add the name of the data frame to the left of $
and our new variable name to the right. Then we can set this variable to be any pre-existing vector, or calculate a new vector based on variables from the data frame.
For example, if we wanted to create a new variable, bmi
, in chol
which shows the BMI of each patient, then we can use the following code.
id | ldl | hdl | trig | age | gender | smoke | weight | height | bmi |
P912 | 175 | 25 | 148 | 39 | female | no | 90.77 | 1.69 | 31.78110 |
P215 | 196 | 36 | 92 | 32 | female | no | 75.06 | 1.75 | 24.50939 |
P063 | 139 | 65 | NA | 42 | male | NA | 73.99 | 1.84 | 21.85432 |
P117 | 162 | 37 | 139 | 30 | female | ex-smoker | 86.25 | 1.83 | 25.75473 |
P613 | 140 | 117 | 59 | 42 | female | ex-smoker | 76.95 | 1.81 | 23.48829 |
P332 | 147 | 51 | 126 | 65 | female | ex-smoker | 57.66 | 1.75 | 18.82776 |
In the education
data frame, create a new variable called ratio
which calculates the pupil to teacher ratio in each level of education. That is,
Now suppose that information on a fourteenth subject is known but has not been included in the original chol
data frame. This data is shown in Table 2.1 below.
id | ldl | hdl | trig | age | gender | smoke | weight | height |
P461 | 148 | 78 | 120 | 41 | male | current | 84.05 | 1.79 |
In this case we can add the new subject as an additional row using the rbind()
First, we need to create a data frame containing the information for this subject. In order for us to add this data frame as a row to chol_full
, it needs to have the same number of variables. Therefore, we also need to calculate the BMI for this subject and call it bmi
. We can do all this with the following code.
subject <- data.frame(id = "P461", ldl = 148, hdl = 78, trig = 120, age = 41,
gender = "male", smoke = "current", weight = 84.05, height = 1.79)
subject$bmi <- subject$weight/(subject$height)^2
Now we can add this subject to chol_full
using the code below.
id | ldl | hdl | trig | age | gender | smoke | weight | height | bmi | |
9 | P725 | 149 | 49 | NA | 32 | female | no | 65.37 | 1.67 | 23.43935 |
10 | P901 | 95 | 54 | 157 | 55 | female | ex-smoker | 80.34 | 1.62 | 30.61271 |
11 | P103 | 169 | 59 | 67 | 48 | female | no | 74.90 | 1.61 | 28.89549 |
12 | P843 | 174 | 117 | 168 | 41 | female | no | 63.78 | 1.77 | 20.35813 |
13 | P753 | 91 | 52 | 146 | 69 | female | current | 71.58 | 1.62 | 27.27481 |
14 | P461 | 148 | 78 | 120 | 41 | male | current | 84.05 | 1.79 | 26.23202 |
Note that tail()
is a function very similar to head()
, but rather than showing the first 6 rows by default, it shows the last 6.
Sections 1.11.2 Creating New Variables in a Data Frame and 1.13 Tables of Probability and Statistics with R describe how to create new variables.
See Appendix A.1 to learn how to create a new variable in a data frame by breaking an exisiting variable into different levels.