Subsetting is a way to extract parts of an object based on some condition. This is very useful when working with large datasets, as it allows you to focus on specific subsets of data that meet certain criteria. Let’s load the size-meas.csv dataset to illustrate this. We can use head() to inspect it’s contents:
head(size.meas)
Clade Species Specimen MiL JL SCm SCL
1 Bothremydidae Cearachelys_placidoi BSP-1976-I-160 NA 31.9 42.9 219.4
2 Carettochelyidae Carettochelys_insculpta SDZ-sn NA 43.3 57.0 253.3
3 Carettochelyidae Carettochelys_insculpta CRI-14 NA 64.5 84.8 485.3
4 Carettochelyidae Carettochelys_insculpta SMF-56626 NA 75.8 86.3 472.4
5 Chelidae Chelodina_colliei CRI-4632 NA 48.2 54.4 274.0
6 Chelidae Chelodina_expansa SMF-67838 NA 43.6 45.9 256.0
This dataset contains measurements of different specimens of turtles as well as information about their taxonomy. Let’s check how many different Clades are sampled in this dataset and how many Chelidae turtles are included. We can easily do that using the function unique() but we can also look at how many levels are in that column treating it as a factor.
When we checked how many lines in that dataframe are identified as Chelidae in their Clade column, we made a subset of that dataframe. To subset data in R, you can use square brackets [] to specify the rows and columns you want to extract. The general syntax is data[rows, columns]. If you leave either the rows or columns section blank, R will return all rows or columns, respectively. For example, to extract all rows for the Clade column, you can use:
Clade <- size.meas[, "Clade"]head(Clade, n =20) ## show first 20 elements
Another way to subset data is by using logical conditions or the which() and subset() functions. For example, to extract all rows where the Clade is “Chelidae”, you can use:
Clade Species Specimen MiL JL SCm SCL
5 Chelidae Chelodina_colliei CRI-4632 NA 48.2 54.4 274.0
6 Chelidae Chelodina_expansa SMF-67838 NA 43.6 45.9 256.0
7 Chelidae Chelodina_expansa QM-J-84101 NA 49.3 52.0 355.6
8 Chelidae Chelodina_longicollis USNM-61091 NA NA 26.9 200.7
9 Chelidae Chelodina_parkeri USNM-231524 NA 42.0 46.7 250.7
10 Chelidae Chelus_fimbriatus MNHN-1897-67 39.7 56.8 63.5 NA
identical(chelidae.logical, chelidae.which) ## check if they are the same
[1] TRUE
identical(chelidae.logical, chelidae.subset)
[1] TRUE
There are other functions that can help working with subsets of data. For example, you can create an intersection between objects using the intersect() function, or find elements that are in one object but not in another using the setdiff() function. Let’s see how many unique genera of turtles are in the Chelidae clade compared to the entire dataset.
all.species <-unique(size.meas$Species) ## how many species in the entire datasetlength(all.species)
[1] 147
chelidae.species <-unique(chelidae.logical$Species) ## how many Chelidae specieslength(chelidae.species)
[1] 20
intersect.species <-intersect(all.species, chelidae.species) ## species in bothlength(intersect.species)
[1] 20
setdiff.species <-setdiff(all.species, chelidae.species) ## species not in Chelidaelength(setdiff.species)