Lesson 5 - Vector Subsetting

August 07, 2018 0 Comments

Individual elements within a vector can be accessed by using square brackets [ ] with the vector name. To extract a single element, put the element number in square brackets following the vector name (note the element positions are 1-based):

> jersey_numbers <- c(Pierce = 34L, Garnett = 5L, Rondo = 9L, Allen = 20L, Perkins = 43L)
> jersey_numbers[1]
Pierce 
    34 

If the vector elements are named, the element name can be used instead of the element number:

> jersey_numbers["Pierce"]
Pierce 
    34 

To extract multiple elements from a vector, pass in an integer class vector to the square brackets. The values of the integer vector correspond to the elements to be extracted. Here we will extract the first, third, and fourth elements of the jersey_numbers vector:

> jersey_numbers[c(1,3,4)]
Pierce  Rondo  Allen 
    34      9     20  

The values of the integer vector can be in any order:

> jersey_numbers[c(4,1,3)]
 Allen Pierce  Rondo 
    20     34      9  

Multiple elements can also be extracted via label names. Pass in a character class vector to the square brackets. The label names can be in any order:

> jersey_numbers[c("Perkins", "Rondo")]
Perkins   Rondo 
     43       9  

The negative sign operator - can be used to specify elements that should not be extracted:

> jersey_numbers
 Pierce Garnett   Rondo   Allen Perkins 
     34       5       9      20      43 
> jersey_numbers[-4] #All elements, except the fourth
 Pierce Garnett   Rondo Perkins 
     34       5       9      43 
> jersey_numbers[-c(4,5)] #All elements, except the fourth and fifth
 Pierce Garnett   Rondo 
     34       5       9  

Integer Sequences

So far, we've been creating integer class vectors with the c() function. To create a vector of consecutive integers, there's a programming shortcut: the colon :. Here are a few examples:

> 1:10 #Vector of integers from 1 to 10
 [1]  1  2  3  4  5  6  7  8  9 10
> sequence_vector <- 20:25 #Vector of integers assigned to a variable
> sequence_vector
[1] 20 21 22 23 24 25
> 15:9 #Vector of integers in reverse.
[1] 15 14 13 12 11 10  9

I suspect I'll find colon : shortcut gets used frequently with R. Back to the jersey_numbers vector. Let's extract the first three elements, and then the last three elements in reverse:

> jersey_numbers[1:3]
 Pierce Garnett   Rondo 
     34       5       9 
> jersey_numbers[5:3]
Perkins   Allen   Rondo 
     43      20       9 

Logical Class Vectors

Here is a design pattern of extracting vector elements using a logical (boolean) class vector. To begin, we'll find the elements of jersey_numbers that are less than 10:

> jersey_numbers < 10
 Pierce Garnett   Rondo   Allen Perkins 
  FALSE    TRUE    TRUE   FALSE   FALSE  

A vector of those logical values lets us extract just the elements that are TRUE (jersey number is less than 10):

> jersey_numbers[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
Garnett   Rondo 
      5       9  

The same can be done programatically:

> single_digits <- jersey_numbers < 10
> jersey_numbers[single_digits]
Garnett   Rondo 
      5       9  

Alternatively, the above could be written as a single statement:

> jersey_numbers[jersey_numbers < 10]
Garnett   Rondo 
      5       9  

Recycling

Our jersey_numbers vector has five elements. What happens if we try to extract elements using a vector of logicals that itself only has two elements?

> jersey_numbers
 Pierce Garnett   Rondo   Allen Perkins 
     34       5       9      20      43 
> jersey_numbers[c(TRUE, FALSE)]
 Pierce   Rondo Perkins 
     34       9      43  

The code above returned the first, third, and fifth elements with no warnings or errors. When R gets to the second (last) element of the logical vector, it "recycles" the vector by going back to the first element. The first element of jersey_numbers is TRUE and the second element is FALSE. The logical vector elements are recycled and repeated for the third and fourth jersey_number elements. The logical vector elements are recycled one more time for the fifth jersey_number element. Essentially, this recycling example is equivalent to this:

> jersey_numbers[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
 Pierce   Rondo Perkins 
     34       9      43 

Dave's Thoughts

There was a good mix of the familiar and unfamiliar for me in this post. Generically speaking, accessing elements/items of a collection/array/list with square brackets is very familiar. I've been using similar, if not identical syntax with C# and VB6/VBA for many years. The negative sign syntax to exclude elements is nice. I bet a lot of SQL developers would be jealous. If we want to select every column in a table, except for one, we have nothing similar to accomplish the task. We're relegated to typing out the names of every column. Sigh...

The ability to access a subset of vector elements without having to use looping constructs feels like a pretty big deal. Loops are generally easy to read and interpret, but they can be poor in terms of performance. I'm reminded of LINQ from the .NET Framework. It allows you to "query" programming objects to get a subset of them--without iterative looping.

0 comments: