Lesson 13 - Subsetting Data Frames

In the last lesson, we learned about data frames and how to create them with the data.frame() function. To recap, a data frame is a structure in R for working with a dataset. It consists of rows and columns, which are called observations and variables respectively. The variables can be a mix of different classes. Here is a data frame example that was used in Lesson 12:

> #Vectors of data for our basketball team.
> name <- c("Larry Bird","Robert Parish","Dennis Johnson","Cedric Maxwell","Gerald Henderson","Kevin McHale","Danny Ainge","M.L. Carr")
> position <- c("PF", "C", "SG", "SF", "PG", "PF", "SG", "SF")
> starter <- c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,FALSE)
> jersey <- c(33L, 00L, 3L, 31L, 43L, 32L, 44L, 30L)
> 
> #Create a data frame.
> Celtics <- data.frame(name, position, starter, jersey)
> Celtics
              name position starter jersey
1       Larry Bird       PF    TRUE     33
2    Robert Parish        C    TRUE      0
3   Dennis Johnson       SG    TRUE      3
4   Cedric Maxwell       SF    TRUE     31
5 Gerald Henderson       PG    TRUE     43
6     Kevin McHale       PF   FALSE     32
7      Danny Ainge       SG   FALSE     44
8        M.L. Carr       SF   FALSE     30

A single element of a data frame can be referenced using single brackets in this pattern: dataframe[r,c], where r is the row number, and c is the column number. c can also be the name of the corresponding column, enclosed in quotes.

> #What is Cedric Maxwell's jersey number?
> Celtics[4, 4]
[1] 31
> #Variable name of the column also works here.
> Celtics[4, "jersey"]
[1] 31

All the variables of a column can be selected by omitting the row number. As above, the column number or variable name of the column can used. Note the return is a vector (or more specifically, a factor).

> #Select a "column" by variable name or index number.
> Celtics[, "position"]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> Celtics[, 2]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> class(Celtics[, 2])
[1] "factor"

As you may have guessed, an observation can be selected by specifying the observation/row number, and omitting the column number/name.

> #Select an observation by row number.
> Celtics[3,]
            name position starter jersey
3 Dennis Johnson       SG    TRUE      3

To subset multiple rows and/or columns, use a vector of numbers (or names). In the next two examples, we subset the third and fourth columns of rows five and six, followed by the "name" and "position" of rows one and two.

> #Select multiple observations & variables.
> #Result is a data frame.
> Celtics[c(5,6), c(3, 4)]
  starter jersey
5    TRUE     43
6   FALSE     32
> Celtics[c(1,2), c("name", "position")]
           name position
1    Larry Bird       PF
2 Robert Parish        C

Columns can also be subsetted with single brackets and either the variable name of the column, or the column number. Here the result is a data frame.

> #Select all the variables of a column. 
> #The next two statements are equivalent.
> Celtics["starter"]
  starter
1    TRUE
2    TRUE
3    TRUE
4    TRUE
5    TRUE
6   FALSE
7   FALSE
8   FALSE
> Celtics[3]
  starter
1    TRUE
2    TRUE
3    TRUE
4    TRUE
5    TRUE
6   FALSE
7   FALSE
8   FALSE
> #Result is a data frame.
> class(Celtics["starter"])
[1] "data.frame"

Subsetting columns with double brackets or the $ shortcut outputs a vector (again, here it is more specifically a factor).

> #All of these are equivalent. 
> Celtics$position
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> Celtics[["position"]]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> Celtics[[2]]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> #Result is a vector.
> class(Celtics$position)
[1] "factor"

Extending Data Frames

Adding columns to a data frame is easy--easy compared to adding rows. We'll get to that. To add a column, first create a vector. The class doesn't matter. But the number of elements does--it has to match the number of observations in the data frame. Now that we have our vector, here are some options to add it as a new column to a data frame: use the $ shortcut, use double brackets with the new column name, bind the vector to the dataframe with cbind().

> #Create a vector...
> points <- c(24.2, 19.0, 13.2, 11.9, 11.6, 18.4, 5.4, 3.1)
> #...add it as a new column of variables.
> Celtics$Points <- points
> Celtics
              name position starter jersey Points
1       Larry Bird       PF    TRUE     33   24.2
2    Robert Parish        C    TRUE      0   19.0
3   Dennis Johnson       SG    TRUE      3   13.2
4   Cedric Maxwell       SF    TRUE     31   11.9
5 Gerald Henderson       PG    TRUE     43   11.6
6     Kevin McHale       PF   FALSE     32   18.4
7      Danny Ainge       SG   FALSE     44    5.4
8        M.L. Carr       SF   FALSE     30    3.1

> #These two options are equivalent to the above.
> Celtics[["Points"]] <- points
> Celtics <- cbind(Celtics, points)

Adding an observation to a data frame is a bit more work. We can't create a new row as a vector and add it to the data frame. This makes sense because vectors have elements of the same class. A row of a data frame can be of different classes. Instead, we have to create a data frame with one or more rows and combine the data frames with rbind(). Let's create a data frame with one row for another player and try to bind it to our existing data frame.

> new_row <- data.frame("Quinn Buckner", "PG", FALSE, 28, 4.1)
> #Note: names must match!
> Celtics <- rbind(Celtics, new_row)
Error in match.names(clabs, names(xi)) : 
  names do not match previous names

The above attempt didn't work, and the error message is rather clear. The names for each data frame must match. That is easily remedied with the names() function:

> #Sync the names.
> names(new_row) <- names(Celtics)
> Celtics <- rbind(Celtics, new_row)
> Celtics
              name position starter jersey Points
1       Larry Bird       PF    TRUE     33   24.2
2    Robert Parish        C    TRUE      0   19.0
3   Dennis Johnson       SG    TRUE      3   13.2
4   Cedric Maxwell       SF    TRUE     31   11.9
5 Gerald Henderson       PG    TRUE     43   11.6
6     Kevin McHale       PF   FALSE     32   18.4
7      Danny Ainge       SG   FALSE     44    5.4
8        M.L. Carr       SF   FALSE     30    3.1
9    Quinn Buckner       PG   FALSE     28    4.1

Dave's Thoughts

As someone without a lot of R experience, knowing the difference between single and double brackets doesn't come naturally to me yet. From my reading, I have inferred the importance of understanding whether the return type of subsetting a data frame is a vector, or another data frame. I was a little surprised (and maybe even discouraged) to see how much effort is needed to add a row to a data frame. In the .NET Framework, there is a DataTable object, which has a NewRow() function that makes adding rows pretty easy. I'm curious to see if there's something else in R that is similar.


Lesson 12 - Data Frames

Those of you with a strong background in databases will see some familiar concepts in this post. For lesson 12, we will consider datasets and structures in R that can accommodate them. Thinking back on what's been covered so far, we know vectors and matrices can't mix data of different classes. A list can contain different classes--it can contain just about anything, including another list. But it's not practical for working with datasets.

Conceptually, a dataset is a grid or table of data elements. It consists of rows, which we specifically call "observations", and of columns , which are called "variables". (Observations may also be referred to as "instances". Variables may also be referred to as "properties".) The data frame in R is designed for data sets. As the R documentation tells us, data frames are "used as the fundamental data structure by most of R's modeling software".

The function we'll be working with primarily in this post is the data.frame() function. I have read that in R programming, creating data frames with this function is rather uncommon. Most of the time, data frames are created by invoking other functions that read data from an external data source (like a file or a database table) with a data frame as the return type. But for simplicity, data.frame() will serve our purposes.

To begin, we need some data for our dataset. Here are four vectors of data for a randomly selected basketball team:

> name <- c("Larry Bird","Robert Parish","Dennis Johnson","Cedric Maxwell","Gerald Henderson","Kevin McHale","Danny Ainge","M.L. Carr")
> position <- c("PF", "C", "SG", "SF", "PG", "PF", "SG", "SF")
> starter <- c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,FALSE)
> jersey <- c(33L, 00L, 3L, 31L, 43L, 32L, 44L, 30L)

Now let's pass those vectors as arguments to data.frame() to create our first data frame:

> #Parameters for data.frame are vectors.
> Celtics <- data.frame(name, position, starter, jersey)
> #Names of the columns are taken from the variable names of the vectors.
> Celtics
              name position starter jersey
1       Larry Bird       PF    TRUE     33
2    Robert Parish        C    TRUE      0
3   Dennis Johnson       SG    TRUE      3
4   Cedric Maxwell       SF    TRUE     31
5 Gerald Henderson       PG    TRUE     43
6     Kevin McHale       PF   FALSE     32
7      Danny Ainge       SG   FALSE     44
8        M.L. Carr       SF   FALSE     30

The output of the Celtics data frame shows our observations (rows) and variables (columns) of data. By default, the variable names are taken from the vector names use in the data.frame() function. If desired, the variable names of an existing data frame can be change with the names() function. (This should be familiar by now.) Variable names can also be specified inline with the data.frame() function.

> #Use names() function, or...
> names(Celtics) <- c("Name", "Position", "Starter", "Jersey Number")
> #...specify column names inline.
> Celtics <- data.frame(Name = name, Position = position, 
+     Starter = starter, "Jersey Number" = jersey)
> Celtics
              Name Position Starter Jersey.Number
1       Larry Bird       PF    TRUE            33
2    Robert Parish        C    TRUE             0
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32
7      Danny Ainge       SG   FALSE            44
8        M.L. Carr       SF   FALSE            30

The structure of a data frame looks similar to that of a list, displaying the number of observations and variables. Take a close look and you'll see the Names and Positions are not character vectors. They're factors.

> #Output of str() looks similar to that of a list.
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : Factor w/ 8 levels "Cedric Maxwell",..: 6 8 3 1 4 5 2 7
 $ Position     : Factor w/ 5 levels "C","PF","PG",..: 2 1 5 4 3 2 5 4
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

The data.frame() function creates strings as factors by default. This behavior can be overridden by setting the stringsAsFactors parameter to FALSE:

> #Force string data to a character vector.
> Celtics <- data.frame(Name = name, Position = position, 
+     Starter = starter, "Jersey Number" = jersey,
+     stringsAsFactors = FALSE)
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : chr  "Larry Bird" "Robert Parish" "Dennis Johnson" "Cedric Maxwell" ...
 $ Position     : chr  "PF" "C" "SG" "SF" ...
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

Now the Names are all character classes. However, I think I'd actually like the Positions to be a factor--there are only five possible values. Let's fix that by changing positions to a factor and recreating the data frame:

> #Change postion from a vector to a factor.
> position <- factor(c("SF", "PF", "C", "SG", "SF", "PG", "PF", "SG"))
> #Force string data to a character vector.
> Celtics <- data.frame(Name = name, Position = position, 
+                       Starter = starter, "Jersey Number" = jersey,
+                       stringsAsFactors = FALSE)
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : chr  "Larry Bird" "Robert Parish" "Dennis Johnson" "Cedric Maxwell" ...
 $ Position     : Factor w/ 5 levels "C","PF","PG",..: 2 1 5 4 3 2 5 4
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

Our sample data set is awfully small. Outputting the entirety of the data frame is of no concern. For large data sets, the head() and tail() functions should come in handy. They output the first or last parts of a data frame:

> #head() and tail() functions return the first 
> #or last parts of the data frame.
> head(Celtics)
              Name Position Starter Jersey.Number
1       Larry Bird       PF    TRUE            33
2    Robert Parish        C    TRUE             0
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32

> tail(Celtics)
              Name Position Starter Jersey.Number
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32
7      Danny Ainge       SG   FALSE            44
8        M.L. Carr       SF   FALSE            30

We'll close with the dim() function. It retrieve the dimension of the data frame.

> #Retrieve the dimension of the data frame.
> dim(Celtics)
[1] 8 4

Dave's Thoughts

It feels like everything we've learned so far has lead to this--the data frame. Thus far, it's the closest thing to a database table that we've seen.


Lesson 11 - List Subsetting

Despite the similarities we've seen between vectors and lists, subsetting is considerably different. Let's revisit the division list from the last post. It has elements for "Name", (number of) "Teams", and "Conference". There are also five list elements--one for each team in the division.

> division <- list(Name = "Atlantic", Teams = 5L, Conference = "Eastern",
+                  list(City = "Boston", Nickname = "Celtics", Championships = 17),
+                  list(City = "New York", Nickname = "Knicks", Championships = 2),
+                  list(City = "Philadelphia", Nickname = "76ers", Championships = 1),
+                  list(City = "Brooklyn", Nickname = "Nets", Championships = 0),
+                  list(City = "Totonto", Nickname = "Raptors", Championships = 0)
+ )
> str(division)
List of 8
 $ Name      : chr "Atlantic"
 $ Teams     : int 5
 $ Conference: chr "Eastern"
 $           :List of 3
  ..$ City         : chr "Boston"
  ..$ Nickname     : chr "Celtics"
  ..$ Championships: num 17
 $           :List of 3
  ..$ City         : chr "New York"
  ..$ Nickname     : chr "Knicks"
  ..$ Championships: num 2
 $           :List of 3
  ..$ City         : chr "Philadelphia"
  ..$ Nickname     : chr "76ers"
  ..$ Championships: num 1
 $           :List of 3
  ..$ City         : chr "Brooklyn"
  ..$ Nickname     : chr "Nets"
  ..$ Championships: num 0
 $           :List of 3
  ..$ City         : chr "Totonto"
  ..$ Nickname     : chr "Raptors"
  ..$ Championships: num 0

Subsetting the list with single brackets [] for the first element returns "Atlantic". But if we take a closer look using the str() function, we see R returned the data as a class of type list:

> #Appears to return "Atlantic" as a character class.
> division[1]
$Name
[1] "Atlantic"

> #str shows us the return is actually a list of 1 element.
> str(division[1])
List of 1
 $ Name: chr "Atlantic"

Subsetting the list with double brackets [[]] for the first element also returns "Atlantic". But this time, the str() function shows us a class of type character is returned:

> #Subsetting with double brackets returns a character class.
> division[[1]]
[1] "Atlantic"
> #Verify with str()
> str(division[[1]])
 chr "Atlantic"

That's the big takeaway from this lesson: subsetting lists with single brackets [] returns a list, whereas double brackets [[]] return a single element. The above examples subsetted by ordinal position. Subsetting by name follows the same rules:

> #Subsetting by name.
> division["Name"]
$Name
[1] "Atlantic"

> str(division["Name"])
List of 1
 $ Name: chr "Atlantic"
> division[["Name"]]
[1] "Atlantic"
> str(division[["Name"]])
 chr "Atlantic"

Now let's subset a list within a list. Note that even if double brackets [[]] are used, the returned data is a list (because the corresponding element is a list itself):

> #Subset the Celtics
> division[[4]]
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

> #Subset the Knicks
> division[[5]]
$City
[1] "New York"

$Nickname
[1] "Knicks"

$Championships
[1] 2

A list can be subsetted to return more than one element. Here are the first and third elements ("Name" and "Conference") of the division list. A class of type list is returned:

> division[c(1,3)]
$Name
[1] "Atlantic"

$Conference
[1] "Eastern"

> str(division[c(1,3)])
List of 2
 $ Name      : chr "Atlantic"
 $ Conference: chr "Eastern"

Here we will subset to return a single element of a list within a list. All of the statements below are equivalent. The use of double brackets [[]] ensures the class returned matches that of the corresponding element, which in this case is numeric. Note that the $ shortcut can only be used with a named list, and is otherwise equivalent to using double brackets [[]]

> #How many championships have the Celtics won?
> #Each statement is equivalent.
> division[[4]][[3]]
[1] 17
> division[[c(4,3)]]
[1] 17
> division[[4]]$Championships
[1] 17

A vector of logicals can be used to subset a list. But this is a single brackets [] operation only. Here are the last five elements of the division list:

> division[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]
[[1]]
[[1]]$City
[1] "Boston"

[[1]]$Nickname
[1] "Celtics"

[[1]]$Championships
[1] 17


[[2]]
[[2]]$City
[1] "New York"

[[2]]$Nickname
[1] "Knicks"

[[2]]$Championships
[1] 2


[[3]]
[[3]]$City
[1] "Philadelphia"

[[3]]$Nickname
[1] "76ers"

[[3]]$Championships
[1] 1


[[4]]
[[4]]$City
[1] "Brooklyn"

[[4]]$Nickname
[1] "Nets"

[[4]]$Championships
[1] 0


[[5]]
[[5]]$City
[1] "Totonto"

[[5]]$Nickname
[1] "Raptors"

[[5]]$Championships
[1] 0

Subsetting with logicals doesn't work with double brackets [[]]:

> division[[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]]
Error in division[[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]] : 
  attempt to select less than one element in integerOneIndex
> division[[F]][[F]][[F]][[T]][[T]][[T]][[T]][[T]]
Error in division[[F]] : 
  attempt to select less than one element in integerOneIndex

We can add elements to an existing list. A character vector of teams that are no longer in the Atlantic Division is created and added to the division list with the name "FormerTeams".

> former_teams <- c("Buffalo Braves", "Charlotte Hornets", "Miami Heat", "Orlando Magic", "Washington Bullets/Wizards")
> division$FormerTeams <- former_teams
> str(division)
List of 9
 $ Name       : chr "Atlantic"
 $ Teams      : int 5
 $ Conference : chr "Eastern"
 $            :List of 3
  ..$ City         : chr "Boston"
  ..$ Nickname     : chr "Celtics"
  ..$ Championships: num 17
 $            :List of 3
  ..$ City         : chr "New York"
  ..$ Nickname     : chr "Knicks"
  ..$ Championships: num 2
 $            :List of 3
  ..$ City         : chr "Philadelphia"
  ..$ Nickname     : chr "76ers"
  ..$ Championships: num 1
 $            :List of 3
  ..$ City         : chr "Brooklyn"
  ..$ Nickname     : chr "Nets"
  ..$ Championships: num 0
 $            :List of 3
  ..$ City         : chr "Totonto"
  ..$ Nickname     : chr "Raptors"
  ..$ Championships: num 0
 $ FormerTeams: chr [1:5] "Buffalo Braves" "Charlotte Hornets" "Miami Heat" "Orlando Magic" ...
> #Could also have done this:
> division[["FormerTeams"]] <- former_teams

Adding an element to a list within a list works similarly to above. Here we add the name of the head coach to the fourth element in the division list:

> #Add element to an embedded list.
> division[[4]][["Coach"]] <- "Brad Stevens"
> division[[4]]
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

$Coach
[1] "Brad Stevens"

> 
> #Could also have done this:FAL
> division[[4]]$Coach <- "Brad Stevens"

Dave's Thoughts

Parts of this lesson were a little tricky IMO, but after distinguishing between single brackets [] and double brackets [[]], it all eventually made sense.

Once again we saw some examples that used the $ programming shortcut. Several times I've wondered why anyone would *not* use it--it's so much more convenient than the alternative. After giving it some thought, I have an educated guess that seems to be correct based on some code tests. Here is a brief sample that demonstrates:

> football_team <- list(Name = "UCF", NickName = "Knights", NationalChampionships = 1L)
> 
> #Reference an element by $ and name, or double brackets and name:
> football_team$Name
[1] "UCF"
> football_team[["NationalChampionships"]]
[1] 1
> 
> #Store an element name in a variable.
> element_name <- "NickName"
> #Reference by double brackets with variable works.
> football_team[[element_name]]
[1] "Knights"
> #Reference by $ and variable fails.
> football_team$element_name
NULL

In short, to use the $ programming shortcut it appears you have to know the element name at design time (when you write the code), and/or know that it will exist at run time.


Lesson 10 - Lists

In previous lessons, we've noted vectors and matrices consist of data elements of the same class. R will coerce data elements to a single class if we attempt to create a vector or matrix with data elements of differing classes. Lists, on the other hand, can hold data elements of different classes, such as the integer, character, or logical class. In fact, a list can hold most anything in R, including vectors, matrices, and many more! None to my surprise, lists can be created with the list() function:

> #Coercion with a vector
> c("Boston", "Celtics", 17)
[1] "Boston"  "Celtics" "17"     
> 
> #No coercion with a list.
> #Output is notably different from that of a vector.
> #Note the double square brackets.
> list("Boston", "Celtics", 17)
[[1]]
[1] "Boston"

[[2]]
[1] "Celtics"

[[3]]
[1] 17

If there's ever a doubt that something is a list, is.list() and/or class() will tell us with certainty:

> team <- list("Boston", "Celtics", 17)
> is.list(team)
[1] TRUE
> class(team)
[1] "list"

As with vectors and matrices, we can use the names() function to name each element of a list. The output changes accordingly:

> #Set the names of a list.
> names(team) <- c("City", "Nickname", "Championships")
> 
> #Double brackets are replaced with names.
> #Note the leading $ character.
> team
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

Like vectors, elements of a list can be referenced by ordinal position or name. There's also a shortcut using the $ character. The format is listName$elementName:

> #Reference a list element by its ordinal position or name.
> team[1]
$City
[1] "Boston"

> team["Nickname"]
$Nickname
[1] "Celtics"

> #Now we can also reference a single list element
> #as follows:
> team$Nickname
[1] "Celtics"
> class(team$City)
[1] "character"
> class(team$Championships)
[1] "numeric"

The list() function also allows names to be provided inline with the element values:

> #Names can also be specified inline with the list() function.
> team <- list(City = "Boston", Nickname = "Celtics", Championships = 17)
> team
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

The R documentation tells us the str() function "displays the internal structure of any R object." Lists are no exception:

> #For a list, the str() function outputs the number of 
> #elements, and the name, class, and value of each element.
> str(team)
List of 3
 $ City         : chr "Boston"
 $ Nickname     : chr "Celtics"
 $ Championships: num 17

Now let's show off a little and create a nested list (list within a list). Here is the NBA's Atlantic Divsion, including the five teams that comprise it:

> division <- list(Name = "Atlantic", Teams = 5L, Conference = "Eastern",
+                  list(City = "Boston", Nickname = "Celtics", Championships = 17),
+                  list(City = "New York", Nickname = "Knicks", Championships = 2),
+                  list(City = "Philadelphia", Nickname = "76ers", Championships = 1),
+                  list(City = "Brooklyn", Nickname = "Nets", Championships = 0),
+                  list(City = "Totonto", Nickname = "Raptors", Championships = 0)
+ )
> str(division)
List of 8
 $ Name      : chr "Atlantic"
 $ Teams     : int 5
 $ Conference: chr "Eastern"
 $           :List of 3
  ..$ City         : chr "Boston"
  ..$ Nickname     : chr "Celtics"
  ..$ Championships: num 17
 $           :List of 3
  ..$ City         : chr "New York"
  ..$ Nickname     : chr "Knicks"
  ..$ Championships: num 2
 $           :List of 3
  ..$ City         : chr "Philadelphia"
  ..$ Nickname     : chr "76ers"
  ..$ Championships: num 1
 $           :List of 3
  ..$ City         : chr "Brooklyn"
  ..$ Nickname     : chr "Nets"
  ..$ Championships: num 0
 $           :List of 3
  ..$ City         : chr "Totonto"
  ..$ Nickname     : chr "Raptors"
  ..$ Championships: num 0

It's nice the way str() indents the nested lists (the five teams). This seems well suited for displaying hierarchical data.


Dave's Thoughts

The $ shortcut is nice. For those of you following along with RStudio, you should notice the intellisense feature. Here's what it looks like for me:

Dave Mason - rstats - RStudio intellisense

I see a parallel between R lists and classes in OO languages. An OO class can have one or more named properties of differing data types. A class can even have another class as a property. Sounds a lot like an R list to me.

A list in R is one dimensional. Surely there's a two dimensional class object in R, right? Hmm.


Lesson 9 - Factors

Somewhat surprisingly, I had a difficult time coming up with a strict definition for a factor. The documentation indirectly refers to it as an encoded vector. It also notes "the terms ‘category’ and ‘enumerated type’ are also used for factors". I've also seen it referred to as a data structure of categorical values. The challenges of definitions aside, a factor can be created via the factor() function. The first parameter of this function is a vector, so let's start there. Here is a vector of basketball player positions ("C" for center, "F" for forward, and "G" for guard):

> positions <- c("C", "F", "G")
> positions
[1] "C" "F" "G"

So far, so good. Now let's create a factor from the vector:

> position_factor <- factor(positions)
> position_factor
[1] C F G
Levels: C F G

The output looks similar to that of a vector, although it drops the double-quotes around the data elements. And there's that last line of "Levels". More on that in a bit. Let's revisit our "definition" of a factor. As was noted, the documentation refers to it as an "encoded vector" or an "enumerated type". To that point, there are integer values behind each element of the factor. We can see this by examining the structure of our factor using the str() function:

> positions <- c("C", "F", "G")
> position_factor <- factor(positions)
> position_factor
[1] C F G
Levels: C F G
> str(position_factor)
 Factor w/ 3 levels "C","F","G": 1 2 3

The str() function shows us the factor has three levels (categories) "C", "F", and "G" with underlying integer values of 1, 2, and 3 respectively. Let's add some more players to the vector (by their position) and re-create the factor:

> #Add more players (by their position).
> positions <- c(positions, "F", "F", "G", "C", "F", "G", "G")
> positions
 [1] "C" "F" "G" "F" "F" "G" "C" "F" "G" "G"
> #Now that the vector has more elements,
> #re-create the factor.
> position_factor <- factor(positions)
> position_factor
 [1] C F G F F G C F G G
Levels: C F G

We see the position of all ten of the players, and the three Levels remain the same. The Levels are the categories (distinct values) of the data elements. If we check the structure of the factor again, all of the data elements have a corresponding integer value.

> #Re-examine the structure of the factor.
> str(position_factor)
 Factor w/ 3 levels "C","F","G": 1 2 3 2 2 3 1 2 3 3

As I understand it, one of the benefits of this is more efficient use of memory. Consider a factor that used the full names "center", "forward", and "guard" as elements. Now imagine there were one million elements in the factor. The character values "center", "forward", and "guard" are stored in memory exactly once each, along with one million integers (as opposed to one million strings of characters).

How many players are there at each position? The summary() function comes in handy here. It shows a count of data elements by level:

> #The summary function:
> summary(position_factor)
 Center Forward   Guard 
      2       4       4 

Order

When the factor function is invoked, it loops through the elements of the vector to get the categories/levels. While it's not clear from the examples so far, by default, the levels are sorted and presented in alphabetical order. This is easily demonstrated by changing the order of the vector elements:

> #By default levels are sorted in alphabetical order.
> factor(c("G", "F", "F", "G", "C", "F", "G", "G"))
[1] G F F G C F G G
Levels: C F G

The factor function has an optional levels parameter. When specified, it dictates the order of the levels.

> #The levels argument can be specified to dictate their order.
> positions <- c("C", "F", "G", "F", "F", "G", "C", "F", "G", "G")
> position_factor <- factor(positions, levels = c("G", "F", "C"))
> position_factor
 [1] C F G F F G C F G G
Levels: G F C
> str(position_factor)
 Factor w/ 3 levels "G","F","C": 3 2 1 2 2 1 3 2 1 1

Now the levels are G, F, and C with integer values 1, 2, and 3 respectively. Another optional parameter is labels, which is used to specify the names of the levels. Let's add the labels "Guard", "Forward", and "Center":

> #The labels argument is used to change the names of the levels.
> positions <- c("C", "F", "G", "F", "F", "G", "C", "F", "G", "G")
> position_factor <- factor(positions, 
+                           labels = c("Guard", "Forward", "Center"))
> position_factor
 [1] Guard   Forward Center  Forward Forward Center  Guard   Forward Center  Center 
Levels: Guard Forward Center
> str(position_factor)
 Factor w/ 3 levels "Guard","Forward",..: 1 2 3 2 2 3 1 2 3 3

NOTE: make sure the data elements for the labels are in the right order to be in sync with the levels (whether specified explicitly or sorted automatically in alphabetical order).

Levels can be named after a factor is created via the levels() function. (This is similar to the way the names() function is used for vectors.)

> positions <- c("C", "F", "G", "F", "F", "G", "C", "F", "G", "G")
> position_factor <- factor(positions)
> position_factor
 [1] C F G F F G C F G G
Levels: C F G
> #Name the levels of an existing factor.
> levels(position_factor) <- c("Center", "Forward","Guard")
> position_factor
 [1] Center  Forward Guard   Forward Forward Guard   Center  Forward Guard   Guard  
Levels: Center Forward Guard

So to recap, the factor() function has parameters for levels and labels, and there is a separate levels() function. Clear as mud, right?


Nominal vs Ordinal

Factor data can be nominal or ordinal. In our examples so far, it is nominal. "C", "G", and "F" (and "Center", "Guard", and "Forward" for that matter) are names that have no comparative order to each other. It's not meaningful to say a Center is greater than a Forward or a Forward is less than a Guard (keep in mind these are position names--don't let height cloud your thinking). If we try making a comparison, we get a warning message:

> position_factor[1] > position_factor[2]
[1] NA
Warning message:
In Ops.factor(position_factor[1], position_factor[2]) :
  ‘>’ not meaningful for factors

Ordinal data, on the other hand, can be compared to each other in some ranked fashion--it has order. Take bed sizes, for instance. A "Twin" bed is smaller than a "Full", which is smaller than a "Queen", which is smaller than a "King". To create a factor with ordered (ranked) levels, use the ordered parameter, which is a logical flag to indicate if the levels should be regarded as ordered (in the order given). Here's a factor of bed sizes:

> #Beds factor with ordered levels.
> beds <- c("Queen", "Full", "Twin", "King", "Twin", "Full", "Twin")
> beds_factor <- factor(beds, ordered = TRUE, 
+   levels = c("Twin", "Full", "Queen", "King"))
> beds_factor
[1] Queen Full  Twin  King  Twin  Full  Twin 
Levels: Twin < Full < Queen < King

Now we can compare individual data elements of the factor. Is one element "greater" than another?

> #Is a Queen bed "larger" than a Twin?
> beds_factor[1]
[1] Queen
Levels: Twin < Full < Queen < King
> beds_factor[3]
[1] Twin
Levels: Twin < Full < Queen < King
> beds_factor[1] > beds_factor[3]
[1] TRUE

Dave's Thoughts

Long post, eh? A factor in some ways reminds me of an enum in C# or Visual Basic. Those structures don't contain data, though. I'm wondering if a factor can be a function parameter/argument, such that invalid values (levels) are rejected or ignored. The summary() function is nice.


Lesson 8 - Matrix Math

In the last lesson, it was noted that matrix subsetting has many commonalities with vector subsetting. The same holds true for matrix math, so this will be a short lesson. Let's begin with a matrix of numbers with four rows and six columns:

> some_numbers <- matrix(1:24, nrow=4, byrow = TRUE)
> some_numbers
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24

Math operations are generally done element-by-element. Here are simple examples for addition, subtraction, multiplication, and division:

> #Add 2 to each element in the matrix.
> some_numbers + 2
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    3    4    5    6    7    8
[2,]    9   10   11   12   13   14
[3,]   15   16   17   18   19   20
[4,]   21   22   23   24   25   26

> #Subtract 3 from each element in the matrix.
> some_numbers - 3
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   -2   -1    0    1    2    3
[2,]    4    5    6    7    8    9
[3,]   10   11   12   13   14   15
[4,]   16   17   18   19   20   21

> #Multiply each element of the matrix by four.
> some_numbers * 4
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    4    8   12   16   20   24
[2,]   28   32   36   40   44   48
[3,]   52   56   60   64   68   72
[4,]   76   80   84   88   92   96

> #Divide each element of the matrix by five.
> some_numbers / 5
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]  0.2  0.4  0.6  0.8  1.0  1.2
[2,]  1.4  1.6  1.8  2.0  2.2  2.4
[3,]  2.6  2.8  3.0  3.2  3.4  3.6
[4,]  3.8  4.0  4.2  4.4  4.6  4.8

Math operations between matrices is possible too. Here, the same matrix is added to itself. Since it's the same matrix, they obviously have the same number of elements. The first element is added to the first element, the second element is added to the second element, etc.

> #Add two matrices.
> some_numbers + some_numbers
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    4    6    8   10   12
[2,]   14   16   18   20   22   24
[3,]   26   28   30   32   34   36
[4,]   38   40   42   44   46   48

When two matrices have a different number of elements, math operations result in an error:

> more_numbers <- matrix(1:20, nrow=4, byrow = TRUE)
> some_numbers + more_numbers
Error in some_numbers + more_numbers : non-conformable arrays
> some_numbers * more_numbers
Error in some_numbers * more_numbers : non-conformable arrays

We can perform math operations between a matrix and a vector, though. Our friend recycling lends a hand. Note the order of the operations: the first element of the vector is added to the matrix element at row one, column one. The second vector element is added to the matrix element at row two, column one. The third vector element is added to the matrix element at row three, column one. The vector elements are recycled, and the process continues at row four, column one of the matrix.

> some_numbers
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24
> #Subtract a vector from the matrix.
> #Note the order of the operations...and recycling.
> some_numbers - c(1,2,3)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    3    3    3
[2,]    5    5    8    8    8   11
[3,]   10   13   13   13   16   16
[4,]   18   18   18   21   21   21

Dave's Thoughts

After learning about vector math, this lesson wasn't particularly revelatory. Hopefully you breezed through it quickly as I did.


Lesson 7 - Matrix Subsetting

In Lesson 5, we looked at vector subsetting, which uses square brackets [ ] in a variety of different ways to extract one or more elements from a vector. Those same concepts apply to matrix subsetting. The most basic usage is to extract a single matrix element in this manner: matrix[r, c] where r is the row number/index and c and is the column number/index. To demonstrate, let's use the points_scored_by_quarter matrix from the last lesson and extract the element in row 4, column 3:

> points_scored_by_quarter <- rbind(c(2,2,6,0), 
+                                   c(12,7,6,4), 
+                                   c(7,3,4,0), 
+                                   c(2,2,3,2), 
+                                   c(6,9,2,4))
> #Add column names and row names
> colnames(points_scored_by_quarter) <- c("1st", "2nd", "3rd", "4th")
> rownames(points_scored_by_quarter) <- c("Perkins", "Garnett", "Pierce", "Rondo", "Allen")
> points_scored_by_quarter
        1st 2nd 3rd 4th
Perkins   2   2   6   0
Garnett  12   7   6   4
Pierce    7   3   4   0
Rondo     2   2   3   2
Allen     6   9   2   4
> points_scored_by_quarter[4,3]
[1] 3

Elements can also be extracted using row names and column names. Let's find out how many points Kevin Garnett scored in the 1st quarter:

> points_scored_by_quarter["Garnett", "1st"]
[1] 12

We can extract an entire row from a matrix. To do this, specify the desired row only within the square brackets [ ]. The placeholder where you would otherwise specify the column is left empty.

> #Points scored by Kendrick Perkins.
> points_scored_by_quarter[1,]
1st 2nd 3rd 4th 
  2   2   6   0 
> points_scored_by_quarter["Perkins",]
1st 2nd 3rd 4th 
  2   2   6   0 

Conversely, we can extract a column from a matrix. Specify the column within the square brackets [ ] and omit the row. The result is a vector, thus the pivot effect--the row names are displayed in the output (not the column name).

> #Points scored by all players in the 4th quarter.
> points_scored_by_quarter[, 4]
Perkins Garnett  Pierce   Rondo   Allen 
      0       4       0       2       4 
> points_scored_by_quarter[, "4th"]
Perkins Garnett  Pierce   Rondo   Allen 
      0       4       0       2       4  

Another way to subset a matrix is to extract multiple elements from one or more rows. How many points did Ray Allen score in the first half (1st and 2nd quarters)? For that, we need row 5 with columns 1 and 2:

> points_scored_by_quarter[5, c(1,2)]
1st 2nd 
  6   9  
> #We can specify any order for rows and columns--reverse the column order.
> points_scored_by_quarter[5, c(2, 1)]
2nd 1st 
  9   6  

Drop

All of the subsetting examples so far have returned a vector. With this behavior, we lose row names, column names, or both. To have a matrix returned (including row names and column names), include the parameter drop = FALSE.

> points_scored_by_quarter[4,3, drop = FALSE]
      3rd
Rondo   3
> points_scored_by_quarter[1, c(4, 3), drop = FALSE]
        4th 3rd
Perkins   0   6
> #When extracting a column, there's no "pivot" with drop = FALSE
> points_scored_by_quarter[,4, drop = FALSE]
        4th
Perkins   0
Garnett   4
Pierce    0
Rondo     2
Allen     4

Multiple Rows and Columns

If our subsetting specifies multiple rows and multiple columns, a matrix is returned without the need to include the drop = FALSE parameter.

> #2nd half points scored by the guards.
> points_scored_by_quarter[c("Allen", "Rondo"), c(3,4)]
      3rd 4th
Allen   2   4
Rondo   3   2
> #Points scored in all quarters by the frontcourt players.
> points_scored_by_quarter[1:3, 1:4]
        1st 2nd 3rd 4th
Perkins   2   2   6   0
Garnett  12   7   6   4
Pierce    7   3   4   0 

Rows and columns can also be specified with logical vectors.

> #Extract rows 1,2, and 5 along with columns 1 and 4.
> points_scored_by_quarter[c(TRUE, TRUE, FALSE, FALSE, TRUE), c(TRUE, FALSE, FALSE, TRUE)]
        1st 4th
Perkins   2   0
Garnett  12   4
Allen     6   4 

As with vectors, recycling applies with matrices.

> #The logical vector for columns has 2 elements. They are both used, 
> #the vector recycles, and the 2 elements are used a second time.
> points_scored_by_quarter[c(TRUE, TRUE, FALSE, FALSE, TRUE), c(TRUE, FALSE)]
        1st 3rd
Perkins   2   6
Garnett  12   6
Allen     6   2

Dave's Thoughts

The drop = FALSE parameter was nice to learn about. That explained a few things for me from the prior lessons. Other than that, most everything here was as expected, based on what was learned with vectors.


Lesson 6 - Matrices

We've spent the last few lessons exploring the vector. In a sense, a vector is a one dimensional array or collection. In R, a matrix is much like a vector, but with two dimensions: one for "rows" and the other for "columns". The matrix() function (among others) can be used to create a matrix:

> matrix(data = 1:18, nrow = 3)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    4    7   10   13   16
[2,]    2    5    8   11   14   17
[3,]    3    6    9   12   15   18

Here, the numbers from 1 to 18 were used to create a matrix. The 1:18 vector becomes the data of the matrix, which is created with three rows by specifying nrow = 3. As a result, there are six columns. Notice the way the 18 elements are placed within the columns and rows. We can produce an equivalent matrix by specifying the same 18 elements, but this time with six columns:

> matrix(data = 1:18, ncol = 6)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    4    7   10   13   16
[2,]    2    5    8   11   14   17
[3,]    3    6    9   12   15   18

In the previous examples, the matrix data was filled "by columns". Let's fill the data "by rows" using the byrow parameter. Again, notice the way the 18 elements are placed within the columns and rows.

> matrix(data = 1:18, ncol = 6, byrow = TRUE)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18

Recycling

In the last lesson, we demonstrated recycling with vector subsetting. Recycling can occur when creating a matrix too. Here, we attempt to use the same vector 1:18 to create a matrix of eight columns and three rows. The numbers from one to eighteen are used once, the vector is recycled, and the numbers from one to six are used a second time. Note the warning message from R:

> matrix(data = 1:18, ncol = 8, nrow = 3, byrow = TRUE)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    2    3    4    5    6    7    8
[2,]    9   10   11   12   13   14   15   16
[3,]   17   18    1    2    3    4    5    6
Warning message:
In matrix(data = 1:18, ncol = 8, nrow = 3, byrow = TRUE) :
  data length [18] is not a sub-multiple or multiple of the number of columns [8]

Other Functions

The cbind() function can create a matrix by combining columns. Here, four vectors are combined, with each being a column:

> cbind(1:3, 1:3, 4:6, 10:12)
     [,1] [,2] [,3] [,4]
[1,]    1    1    4   10
[2,]    2    2    5   11
[3,]    3    3    6   12

Likewise, the rbind() function creates a matrix by combining vectors, with each vector being a row:

> rbind(1:3, 1:3, 4:6, 10:12)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    3
[3,]    4    5    6
[4,]   10   11   12

As with vectors, the colnames() function can name the columns of a matrix:

> points_scored_by_quarter <- rbind(c(2,2,6,0),
+                                   c(12,7,6,4),
+                                   c(7,3,4,0),
+                                   c(2,2,3,2),
+                                   c(6,9,2,4))
> colnames(points_scored_by_quarter) <- c("1st", "2nd", "3rd", "4th")
> points_scored_by_quarter
     1st 2nd 3rd 4th
[1,]   2   2   6   0
[2,]  12   7   6   4
[3,]   7   3   4   0
[4,]   2   2   3   2
[5,]   6   9   2   4

Now, with a matrix, we can name our rows with the rownames() function:

> rownames(points_scored_by_quarter) <- c("Perkins", "Garnett", "Pierce", "Rondo", "Allen")
> points_scored_by_quarter
        1st 2nd 3rd 4th
Perkins   2   2   6   0
Garnett  12   7   6   4
Pierce    7   3   4   0
Rondo     2   2   3   2
Allen     6   9   2   4

Aggregating Data

The data in points_scored_by_quarter can be aggregated by row or column. The colSums() function will show us the combined points scored by the five players for each quarter:

> #Sum of player points per quarter.
> colSums(points_scored_by_quarter)
1st 2nd 3rd 4th 
 29  23  21  10 

The rowSums() function will show us the total points scored per player. Note the pivoting effect--row names become column names:

> #Total points scored per player.
> rowSums(points_scored_by_quarter)
Perkins Garnett  Pierce   Rondo   Allen 
     10      29      14       9      21  

Coercion

All of the examples so far have consisted of matrices with data elements of the same class. And for good reason: it's a requirement for a matrix. R will coerce elements with mismatched classes to the same class. Here are two vectors, one of class integer and the other of class character. After combining them into a matrix via rbind(), we see the first row of data elements are of the character class (enclosed in double quotes):

> row1 <- c(1L, 2L, 3L, 4L)
> row2 <- c("a", "b", "c", "d")
>  new_matrix <- rbind(row1, row2)
> new_matrix
     [,1] [,2] [,3] [,4]
row1 "1"  "2"  "3"  "4" 
row2 "a"  "b"  "c"  "d" 

Dave's Thoughts

As an RDBMS person, the matrix with its rows and columns is very relatable. We're still a step or two away from something that mimics a database table. For that, we'd need to be able to mix elements of different classes. Is there an R construct for that? Hmmmmmm... The rowSums() function is intriguing. Pivoting data with T-SQL is not easy.


Lesson 5 - Vector Subsetting

Individual elements within a vector can be accessed by using square brackets [ ] with the vector name. To extract a single element, put the element number in square brackets following the vector name (note the element positions are 1-based):

> jersey_numbers <- c(Pierce = 34L, Garnett = 5L, Rondo = 9L, Allen = 20L, Perkins = 43L)
> jersey_numbers[1]
Pierce 
    34 

If the vector elements are named, the element name can be used instead of the element number:

> jersey_numbers["Pierce"]
Pierce 
    34 

To extract multiple elements from a vector, pass in an integer class vector to the square brackets. The values of the integer vector correspond to the elements to be extracted. Here we will extract the first, third, and fourth elements of the jersey_numbers vector:

> jersey_numbers[c(1,3,4)]
Pierce  Rondo  Allen 
    34      9     20  

The values of the integer vector can be in any order:

> jersey_numbers[c(4,1,3)]
 Allen Pierce  Rondo 
    20     34      9  

Multiple elements can also be extracted via label names. Pass in a character class vector to the square brackets. The label names can be in any order:

> jersey_numbers[c("Perkins", "Rondo")]
Perkins   Rondo 
     43       9  

The negative sign operator - can be used to specify elements that should not be extracted:

> jersey_numbers
 Pierce Garnett   Rondo   Allen Perkins 
     34       5       9      20      43 
> jersey_numbers[-4] #All elements, except the fourth
 Pierce Garnett   Rondo Perkins 
     34       5       9      43 
> jersey_numbers[-c(4,5)] #All elements, except the fourth and fifth
 Pierce Garnett   Rondo 
     34       5       9  

Integer Sequences

So far, we've been creating integer class vectors with the c() function. To create a vector of consecutive integers, there's a programming shortcut: the colon :. Here are a few examples:

> 1:10 #Vector of integers from 1 to 10
 [1]  1  2  3  4  5  6  7  8  9 10
> sequence_vector <- 20:25 #Vector of integers assigned to a variable
> sequence_vector
[1] 20 21 22 23 24 25
> 15:9 #Vector of integers in reverse.
[1] 15 14 13 12 11 10  9

I suspect I'll find colon : shortcut gets used frequently with R. Back to the jersey_numbers vector. Let's extract the first three elements, and then the last three elements in reverse:

> jersey_numbers[1:3]
 Pierce Garnett   Rondo 
     34       5       9 
> jersey_numbers[5:3]
Perkins   Allen   Rondo 
     43      20       9 

Logical Class Vectors

Here is a design pattern of extracting vector elements using a logical (boolean) class vector. To begin, we'll find the elements of jersey_numbers that are less than 10:

> jersey_numbers < 10
 Pierce Garnett   Rondo   Allen Perkins 
  FALSE    TRUE    TRUE   FALSE   FALSE  

A vector of those logical values lets us extract just the elements that are TRUE (jersey number is less than 10):

> jersey_numbers[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
Garnett   Rondo 
      5       9  

The same can be done programatically:

> single_digits <- jersey_numbers < 10
> jersey_numbers[single_digits]
Garnett   Rondo 
      5       9  

Alternatively, the above could be written as a single statement:

> jersey_numbers[jersey_numbers < 10]
Garnett   Rondo 
      5       9  

Recycling

Our jersey_numbers vector has five elements. What happens if we try to extract elements using a vector of logicals that itself only has two elements?

> jersey_numbers
 Pierce Garnett   Rondo   Allen Perkins 
     34       5       9      20      43 
> jersey_numbers[c(TRUE, FALSE)]
 Pierce   Rondo Perkins 
     34       9      43  

The code above returned the first, third, and fifth elements with no warnings or errors. When R gets to the second (last) element of the logical vector, it "recycles" the vector by going back to the first element. The first element of jersey_numbers is TRUE and the second element is FALSE. The logical vector elements are recycled and repeated for the third and fourth jersey_number elements. The logical vector elements are recycled one more time for the fifth jersey_number element. Essentially, this recycling example is equivalent to this:

> jersey_numbers[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
 Pierce   Rondo Perkins 
     34       9      43 

Dave's Thoughts

There was a good mix of the familiar and unfamiliar for me in this post. Generically speaking, accessing elements/items of a collection/array/list with square brackets is very familiar. I've been using similar, if not identical syntax with C# and VB6/VBA for many years. The negative sign syntax to exclude elements is nice. I bet a lot of SQL developers would be jealous. If we want to select every column in a table, except for one, we have nothing similar to accomplish the task. We're relegated to typing out the names of every column. Sigh...

The ability to access a subset of vector elements without having to use looping constructs feels like a pretty big deal. Loops are generally easy to read and interpret, but they can be poor in terms of performance. I'm reminded of LINQ from the .NET Framework. It allows you to "query" programming objects to get a subset of them--without iterative looping.


Lesson 4 - Vector Math

Vectors of class integer or numeric can participate in mathematical operations. Operations aren't performed on the vector as a whole, though. They are performed element by element. Here is an integer vector with five elements. I'll add three to the vector:

> jersey_numbers <- c(00L, 3L, 32L, 33L, 44L)
> jersey_numbers + 3
[1]  3  6 35 36 47

We see that each element of the vector increased by a value of three. Subtraction, multiplication, and division work in the same manner (notice for division, the class was coerced to numeric):

> jersey_numbers - 4
[1] -4 -1 28 29 40
> jersey_numbers * 5
[1]   0  15 160 165 220
> jersey_numbers / 6
[1] 0.000000 0.500000 5.333333 5.500000 7.333333

Let's imagine we are a regional sales manager. The numbers for the first quarter are in. There are 75 new customers for January, 110 for February, and 85 for March. A vector named new_customers is created:

> new_customers <- c(Jan = 75L, Feb = 110L, Mar = 85L)
> new_customers
Jan Feb Mar 
 75 110  85 

The monthly sales goal is to add 100 new customers. How did we do for Q1? For a vector with such a small number of elements, it's pretty obvious just by looking. But, vector math can also tell us. Here we'll use the > greater than operator:

> new_customers > 100
  Jan   Feb   Mar 
FALSE  TRUE FALSE 

From our first quarter data, we also know that we lost 90 customers in January, 80 in February, and 82 in March. A vector named customers_lost is created:

> customers_lost <- c(90L, 80L, 82L)
> customers_lost
[1] 90 80 82

Now we can determine the number of customers gained vs number of customers lost (plus/minus) for each month of the quarter by subtracting one vector from another. Each vector has the same number of elements (three), and the result is also a vector of three elements:

> net_customer_gain <- new_customers - customers_lost
> net_customer_gain
Jan Feb Mar 
-15  30   3 

The sum() function can be used to add up all the elements of a vector. Below, we get the total number of new customers and lost customers for the first quarter:

> sum(new_customers)
[1] 270
> sum(customers_lost)
[1] 252

Did we experience a net gain or loss in customers for the first quarter? Again, the sum() function can tell us:

> sum(net_customer_gain)
[1] 18

Dave's Thoughts

We haven't gotten to any looping or iterative constructs yet in R, but I suspect the sum() function is vastly more efficient than trying to do it yourself. What happens when trying math on vectors with a different number of elements? We get a warning message, along with an answer.

> c(1,2,3) + c(4,4,4,4)
[1] 5 6 7 5
Warning message:
In c(1, 2, 3) + c(4, 4, 4, 4) :
  longer object length is not a multiple of shorter object length
> c(4,5,6) < c(5,5,5,5)
[1]  TRUE FALSE FALSE  TRUE
Warning message:
In c(4, 5, 6) < c(5, 5, 5, 5) :
  longer object length is not a multiple of shorter object length

Well that's curious. How'd R determine the 4th element in each vector result? I think I have an answer. Let's discuss in the next lesson.