Subsetting

Subsetting is a way to extract parts of an object based on some condition. This is very useful when working with large datasets, as it allows you to focus on specific subsets of data that meet certain criteria. Let’s load the size-meas.csv dataset to illustrate this. We can use head() to inspect it’s contents:

head(size.meas)
             Clade                 Species       Specimen MiL   JL  SCm   SCL
1    Bothremydidae    Cearachelys_placidoi BSP-1976-I-160  NA 31.9 42.9 219.4
2 Carettochelyidae Carettochelys_insculpta         SDZ-sn  NA 43.3 57.0 253.3
3 Carettochelyidae Carettochelys_insculpta         CRI-14  NA 64.5 84.8 485.3
4 Carettochelyidae Carettochelys_insculpta      SMF-56626  NA 75.8 86.3 472.4
5         Chelidae       Chelodina_colliei       CRI-4632  NA 48.2 54.4 274.0
6         Chelidae       Chelodina_expansa      SMF-67838  NA 43.6 45.9 256.0

This dataset contains measurements of different specimens of turtles as well as information about their taxonomy. Let’s check how many different Clades are sampled in this dataset and how many Chelidae turtles are included. We can easily do that using the function unique() but we can also look at how many levels are in that column treating it as a factor.

unique(size.meas$Clade)
 [1] "Bothremydidae"    "Carettochelyidae" "Chelidae"         "Cheloniidae"     
 [5] "Chelydridae"      "Dermatemydidae"   "Dermochelyidae"   "Emydidae"        
 [9] "Geoemydidae"      "Kinosternidae"    "Pelomedusidae"    "Platysternidae"  
[13] "Podocnemididae"   "Testudinidae"     "Trionychidae"    
levels(as.factor(size.meas$Clade))
 [1] "Bothremydidae"    "Carettochelyidae" "Chelidae"         "Cheloniidae"     
 [5] "Chelydridae"      "Dermatemydidae"   "Dermochelyidae"   "Emydidae"        
 [9] "Geoemydidae"      "Kinosternidae"    "Pelomedusidae"    "Platysternidae"  
[13] "Podocnemididae"   "Testudinidae"     "Trionychidae"    
length(size.meas$Clade[which(size.meas$Clade == "Chelidae")])
[1] 35

When we checked how many lines in that dataframe are identified as Chelidae in their Clade column, we made a subset of that dataframe. To subset data in R, you can use square brackets [] to specify the rows and columns you want to extract. The general syntax is data[rows, columns]. If you leave either the rows or columns section blank, R will return all rows or columns, respectively. For example, to extract all rows for the Clade column, you can use:

Clade <- size.meas[, "Clade"]
head(Clade, n = 20)  ## show first 20 elements
 [1] "Bothremydidae"    "Carettochelyidae" "Carettochelyidae" "Carettochelyidae"
 [5] "Chelidae"         "Chelidae"         "Chelidae"         "Chelidae"        
 [9] "Chelidae"         "Chelidae"         "Chelidae"         "Chelidae"        
[13] "Chelidae"         "Chelidae"         "Chelidae"         "Chelidae"        
[17] "Chelidae"         "Chelidae"         "Chelidae"         "Chelidae"        
summary(Clade)
   Length     Class      Mode 
      354 character character 

Another way to subset data is by using logical conditions or the which() and subset() functions. For example, to extract all rows where the Clade is “Chelidae”, you can use:

chelidae.logical <- size.meas[size.meas$Clade == "Chelidae",]
head(chelidae.logical)
      Clade               Species     Specimen  MiL   JL  SCm   SCL
5  Chelidae     Chelodina_colliei     CRI-4632   NA 48.2 54.4 274.0
6  Chelidae     Chelodina_expansa    SMF-67838   NA 43.6 45.9 256.0
7  Chelidae     Chelodina_expansa   QM-J-84101   NA 49.3 52.0 355.6
8  Chelidae Chelodina_longicollis   USNM-61091   NA   NA 26.9 200.7
9  Chelidae     Chelodina_parkeri  USNM-231524   NA 42.0 46.7 250.7
10 Chelidae     Chelus_fimbriatus MNHN-1897-67 39.7 56.8 63.5    NA
chelidae.which <- size.meas[which(size.meas$Clade == "Chelidae"), ]
head(chelidae.which)
      Clade               Species     Specimen  MiL   JL  SCm   SCL
5  Chelidae     Chelodina_colliei     CRI-4632   NA 48.2 54.4 274.0
6  Chelidae     Chelodina_expansa    SMF-67838   NA 43.6 45.9 256.0
7  Chelidae     Chelodina_expansa   QM-J-84101   NA 49.3 52.0 355.6
8  Chelidae Chelodina_longicollis   USNM-61091   NA   NA 26.9 200.7
9  Chelidae     Chelodina_parkeri  USNM-231524   NA 42.0 46.7 250.7
10 Chelidae     Chelus_fimbriatus MNHN-1897-67 39.7 56.8 63.5    NA
chelidae.subset <- subset(size.meas, Clade == "Chelidae")
head(chelidae.subset)
      Clade               Species     Specimen  MiL   JL  SCm   SCL
5  Chelidae     Chelodina_colliei     CRI-4632   NA 48.2 54.4 274.0
6  Chelidae     Chelodina_expansa    SMF-67838   NA 43.6 45.9 256.0
7  Chelidae     Chelodina_expansa   QM-J-84101   NA 49.3 52.0 355.6
8  Chelidae Chelodina_longicollis   USNM-61091   NA   NA 26.9 200.7
9  Chelidae     Chelodina_parkeri  USNM-231524   NA 42.0 46.7 250.7
10 Chelidae     Chelus_fimbriatus MNHN-1897-67 39.7 56.8 63.5    NA
identical(chelidae.logical, chelidae.which)  ## check if they are the same
[1] TRUE
identical(chelidae.logical, chelidae.subset)  
[1] TRUE

There are other functions that can help working with subsets of data. For example, you can create an intersection between objects using the intersect() function, or find elements that are in one object but not in another using the setdiff() function. Let’s see how many unique genera of turtles are in the Chelidae clade compared to the entire dataset.

all.species <- unique(size.meas$Species)  ## how many species in the entire dataset
length(all.species)
[1] 147
chelidae.species <- unique(chelidae.logical$Species)  ## how many Chelidae species
length(chelidae.species)
[1] 20
intersect.species <- intersect(all.species, chelidae.species)  ## species in both
length(intersect.species)
[1] 20
setdiff.species <- setdiff(all.species, chelidae.species)  ## species not in Chelidae
length(setdiff.species)
[1] 127