Lesson 15 - Beginning Graphics

Everything covered so far has been textual in nature-either code written or output displayed in the console. For this post I'll take a look at some functions for visualizing data in R. I've fretted about this post for a while. Data visualizations in R is surely a huge topic, far too big for a single post. With that in mind, I've chosen to keep things as simple as possible. For each function, there will be two examples: one with the bare minimum of arguments, and another with some optional parameters that I found interesting. Examples will use the following data frame:

> Playoffs
    League TotalTeams PlayoffTeams PercentPlayoffQualifiers
1      MLB         30           10                33.333333
2      NBA         30           16                53.333333
3 NCAA-FBS        129            4                 3.100775
4 NCAA-FCS        124           24                19.354839
5    NCAAM        351           64                18.233618
6    NCAAW        349           64                18.338109
7      NHL         31           16                51.612903
8      NFL         32           12                37.500000
9     WNBA         12            8                66.666667

Plots

A few words about plot() from the documentation:

"Generic function for plotting of R objects... For simple scatter plots, plot.default will be used. However, there are plot methods for many R objects, including functions, data.frames, density objects, etc."

Here is a plot showing the percentage of teams that qualify for the playoffs in each league:

> #Plot the percentage of teams in each 
> #league that qualify for the playoffs.
> plot(Playoffs$PercentPlayoffQualifiers)
Dave Mason - Zero To R - plot

The "Index" on the horizontal/x-axis corresponds to the row numbers in the data frame, while the vertical/y-axis corresponds to the percentage data in the PercentPlayoffQualifiers column of the data frame. Here is the plot of the same data, with the ylab and main parameters:

> plot(Playoffs$PercentPlayoffQualifiers, 
+      ylab = "Percentage", 
+      main = "Sports Leagues\n% of teams qualifying for playoffs")
Dave Mason - Zero To R - plot params

Histograms

The R documentation for the hist() function is short and succint:

"The generic function hist computes a histogram of the given data values."

To add to that, here is a definition of a histogram from Wikipedia:

"A histogram is an accurate representation of the distribution of numerical data."

Here is a histogram showing the distribution of data for the number of teams that qualify for the playoffs in each league:

> #Histogram for the number of playoff teams.
> hist(Playoffs$PlayoffTeams)
Dave Mason - Zero To R - hist

The first bar tells us there are three leagues with 0 to 10 teams that qualify for the playoffs. The second bar tells us there are three leagues with 11 to 20 teams, etc. Here is a histogram of the same data, with the main, xlab, col, and breaks parameters:

> hist(Playoffs$PlayoffTeams, 
+      main = "Sports League Playoffs", 
+      xlab = "Number of Qualifying Teams", 
+      col = "gray", 
+      breaks = 9)
Dave Mason - Zero To R - hist params

Bar Plots

The R documentation is also brief for the hist() function:

"Creates a bar plot with vertical or horizontal bars."

Bar plots would seem to be the same as another common data visualization: the "bar chart". Here is a bar plot showing the percentage of teams that qualify for the playoffs, by league:

> #Percentage of playoff qualifiers, by league.
> barplot(Playoffs$PercentPlayoffQualifiers)
Dave Mason - Zero To R - barplot

From left to right, each bar corresponds to a row from first to last in the source data frame. Here is the same bar plot with the ylab, main, and col parameters:

> #Percentage of playoff qualifiers, by league.
> barplot(Playoffs$PercentPlayoffQualifiers,
+         ylab = "Percentage", 
+         main = "Sports Leagues\n% of teams in playoff",
+         col = "blue")
Dave Mason - Zero To R - barplot params

Box Plots

One final trip to the R documentation, this time for boxplot():

Produce box-and-whisker plot(s) of the given (grouped) values.

To add to that, Wikipedia tells us a box plot is...

"a method for graphically depicting groups of numerical data through their quartiles."

Here is the boxplot() function, with a vector of integers:

> #Percentage of playoff qualifiers, by league.
> boxplot(Playoffs$PlayoffTeams)
Dave Mason - Zero To R - boxplot

That visualization doesn't tell us much. Let's try it again with the formula Playoffs$PercentPlayoffQualifiers ~ Playoffs$League. The numeric values of Playoffs$PercentPlayoffQualifiers get split into groups according to the grouping values of Playoffs$League:

> #Percentage of playoff qualifiers, by league.
> boxplot(Playoffs$PercentPlayoffQualifiers ~ Playoffs$League,
+         border = c("Red", "Blue"))
Dave Mason - Zero To R - boxplot params

This time, there's some sense to be made of my boxplot example, including League names along the horizontal/x-axis. But the boxplot still remains a question mark to me. Perhaps the data doesn't lend itself to a good demonstration. Perhaps there's far more to the boxplot than a few lines of code can convey. So be it.


Dave's Thoughts

Professionally, I'm accustomed to interpreting and analyzing data non-visually in the SQL Server Management Studio query results grid. I've dabbled with a few different visualizations, including Excel charts, Google Charts, and even Tableau. I can now add R to that short list.

I've seen the topic of ethics in data science a number of times. I'm not sure if visualizations apply to that topic, but I do find it interesting that the same set of data can be displayed with a significant difference by manipulating parameters of a visualization function. If you hold the keys to the visualization tools and functions, can you have a data set paint any picture you want?


RStudio - A Brief Overview

In some upcoming lessons, we'll be looking at graphics and visualizations with R. But prior to that, I want to briefly cover RStudio, which is the development tool I've been using. The application is sectioned in quadrants, labelled below as 1, 2, 3, and 4.

Dave Mason RStudio

Quadrant 1 is the "Console". It has a default message when RStudio is initially started. Read it if you like, then enter keystrokes CTRL + L to clear the console screen. Afterwards you should see a typical command prompt, somewhat similar to a DOS or PowerShell command prompt. Type some commands and hit Enter to see the output. (You'll notice the console is actually one of two tabs in Quadrant 1. The second tab is "Terminal". To be frank, I don't know what that tab is for.)

Dave Mason RStudio Console

Quadrant 2 is where you can create and edit R source code files. You can run R code from your source file directly from here. The results are output to the console.

Dave Mason RStudio Source

Quadrant 3 displays the Global Environment. The "Environment" tab shows all the variables that have been declared and are in memory. Variables of different classes may be displayed differently. For some classes, the values are displayed. For others, metadata is shown.

Dave Mason RStudio Global Environment

The final quadrant has tabs for "Files", "Plots", "Packages", "Help", and "Viewer"--I'll focus on just two of them in this post. The Help tab shows R documentation. To invoke it, type a question mark character ? in the console, followed by the command you want help with. For instance, ?ls displays the "List Objects" documentation.

Dave Mason RStudio Help

Lastly, there is the Plots tab of quadrant 4. It displays graphical output of various R functions, such as plot(). I'll briefly examine plot and a few other functions in the next post. If you want to follow along, you should be able to generate simple visualizations like this one:

Dave Mason RStudio Plots

Lesson 14 - Subsetting Data Frames (continued)

Sorting and filtering data frames would have fit in logically with the last post, but since it had gotten a bit long, let's take a fresh look at those concepts here. To begin, let's use a preconfigured data frame of players for a totally random basketball team. There are 12 observations (players) and 15 variables:

> Celtics
            Name Age Position GM GS   MP   FG%   FT%  REB AST STL BLK  TO  PF Points
Cedric   Maxwell  28       SF 80 78 31.3 0.532 0.753  5.8 2.6 0.8 0.3 2.5 2.8   11.9
Danny      Ainge  24       SG 71  3 16.3 0.460 0.821  1.6 2.3 0.6 0.1 1.0 2.0    5.4
Dennis   Johnson  29       SG 80 78 33.3 0.437 0.852  3.5 4.2 1.2 0.7 2.2 3.1   13.2
Gerald Henderson  28       PG 78 78 26.8 0.524 0.768  1.9 3.8 1.5 0.2 2.1 2.7   11.6
Greg        Kite  22        C 35  1  5.6 0.455 0.313  1.8 0.2 0.0 0.1 0.6 1.2    1.9
Kevin     McHale  26       PF 82 10 31.4 0.556 0.765  7.4 1.3 0.3 1.5 1.8 3.0   18.4
Larry       Bird  27       PF 79 77 38.3 0.492 0.888 10.1 6.6 1.8 0.9 3.0 2.5   24.2
M.L.        Carr  33       SF 60  1  9.8 0.409 0.875  1.3 0.8 0.3 0.1 0.8 1.1    3.1
Quinn    Buckner  29       PG 79  0 15.8 0.427 0.649  1.7 2.7 1.1 0.0 1.3 2.4    4.1
Robert    Parish  30        C 80 79 35.8 0.546 0.745 10.7 1.7 0.7 1.5 2.3 3.3   19.0
Scott     Wedman  31       SF 68  5 13.5 0.444 0.829  2.0 1.0 0.4 0.1 0.6 1.6    4.8
Carlos     Clark  23       SG 31  0  4.1 0.365 0.889  0.5 0.5 0.3 0.0 0.4 0.4    1.7

Now that we have a dataset to work with, let's consider the order() function. As the R documentation explains, this function "returns a permutation which rearranges its first argument into ascending or descending order". We can order the "Points" vector as follows:

> Celtics$Points
 [1] 11.9  5.4 13.2 11.6  1.9 18.4 24.2  3.1  4.1 19.0  4.8  1.7
> order(Celtics$Points)
 [1] 12  5  8  9 11  2  4  1  3  6 10  7

The output above is telling us the 12th element (1.7 Points) is the least amount of points, followed by the 5th element (1.9 points), followed by the 8th element (3.1 points), etc. Recall from the last lesson that a data frame can be subsetted in this format: dataframe[r,c], where r and c are vectors indicating the desired rows and columns, respectively. For instance, we can subset the first three rows as follows:

> Celtics[c(1:3),]
          Name Age Position GM GS   MP   FG%   FT% REB AST STL BLK  TO  PF Points
Cedric Maxwell  28       SF 80 78 31.3 0.532 0.753 5.8 2.6 0.8 0.3 2.5 2.8   11.9
Danny    Ainge  24       SG 71  3 16.3 0.460 0.821 1.6 2.3 0.6 0.1 1.0 2.0    5.4
Dennis Johnson  29       SG 80 78 33.3 0.437 0.852 3.5 4.2 1.2 0.7 2.2 3.1   13.2

The output of order() is a vector. We can use that to subset our dataframe, ordering the observations by any variable we choose. Here again is the Celtics dataframe, ordered by Points (from least to most):

> Celtics[order(Celtics$Points),]
            Name Age Position GM GS   MP   FG%   FT%  REB AST STL BLK  TO  PF Points
Carlos     Clark  23       SG 31  0  4.1 0.365 0.889  0.5 0.5 0.3 0.0 0.4 0.4    1.7
Greg        Kite  22        C 35  1  5.6 0.455 0.313  1.8 0.2 0.0 0.1 0.6 1.2    1.9
M.L.        Carr  33       SF 60  1  9.8 0.409 0.875  1.3 0.8 0.3 0.1 0.8 1.1    3.1
Quinn    Buckner  29       PG 79  0 15.8 0.427 0.649  1.7 2.7 1.1 0.0 1.3 2.4    4.1
Scott     Wedman  31       SF 68  5 13.5 0.444 0.829  2.0 1.0 0.4 0.1 0.6 1.6    4.8
Danny      Ainge  24       SG 71  3 16.3 0.460 0.821  1.6 2.3 0.6 0.1 1.0 2.0    5.4
Gerald Henderson  28       PG 78 78 26.8 0.524 0.768  1.9 3.8 1.5 0.2 2.1 2.7   11.6
Cedric   Maxwell  28       SF 80 78 31.3 0.532 0.753  5.8 2.6 0.8 0.3 2.5 2.8   11.9
Dennis   Johnson  29       SG 80 78 33.3 0.437 0.852  3.5 4.2 1.2 0.7 2.2 3.1   13.2
Kevin     McHale  26       PF 82 10 31.4 0.556 0.765  7.4 1.3 0.3 1.5 1.8 3.0   18.4
Robert    Parish  30        C 80 79 35.8 0.546 0.745 10.7 1.7 0.7 1.5 2.3 3.3   19.0
Larry       Bird  27       PF 79 77 38.3 0.492 0.888 10.1 6.6 1.8 0.9 3.0 2.5   24.2

If desired, we can specify reverse order with the decreasing parameter:

> Celtics[order(Celtics$Points, decreasing = TRUE),]
            Name Age Position GM GS   MP   FG%   FT%  REB AST STL BLK  TO  PF Points
Larry       Bird  27       PF 79 77 38.3 0.492 0.888 10.1 6.6 1.8 0.9 3.0 2.5   24.2
Robert    Parish  30        C 80 79 35.8 0.546 0.745 10.7 1.7 0.7 1.5 2.3 3.3   19.0
Kevin     McHale  26       PF 82 10 31.4 0.556 0.765  7.4 1.3 0.3 1.5 1.8 3.0   18.4
Dennis   Johnson  29       SG 80 78 33.3 0.437 0.852  3.5 4.2 1.2 0.7 2.2 3.1   13.2
Cedric   Maxwell  28       SF 80 78 31.3 0.532 0.753  5.8 2.6 0.8 0.3 2.5 2.8   11.9
Gerald Henderson  28       PG 78 78 26.8 0.524 0.768  1.9 3.8 1.5 0.2 2.1 2.7   11.6
Danny      Ainge  24       SG 71  3 16.3 0.460 0.821  1.6 2.3 0.6 0.1 1.0 2.0    5.4
Scott     Wedman  31       SF 68  5 13.5 0.444 0.829  2.0 1.0 0.4 0.1 0.6 1.6    4.8
Quinn    Buckner  29       PG 79  0 15.8 0.427 0.649  1.7 2.7 1.1 0.0 1.3 2.4    4.1
M.L.        Carr  33       SF 60  1  9.8 0.409 0.875  1.3 0.8 0.3 0.1 0.8 1.1    3.1
Greg        Kite  22        C 35  1  5.6 0.455 0.313  1.8 0.2 0.0 0.1 0.6 1.2    1.9
Carlos     Clark  23       SG 31  0  4.1 0.365 0.889  0.5 0.5 0.3 0.0 0.4 0.4    1.7

Our dataframe can be subsetted using vectors of logicals. Here again are the first three observations:

> Celtics[c(T,T,T,F,F,F,F,F,F,F,F,F,F,F,F),]
          Name Age Position GM GS   MP   FG%   FT% REB AST STL BLK  TO  PF Points
Cedric Maxwell  28       SF 80 78 31.3 0.532 0.753 5.8 2.6 0.8 0.3 2.5 2.8   11.9
Danny    Ainge  24       SG 71  3 16.3 0.460 0.821 1.6 2.3 0.6 0.1 1.0 2.0    5.4
Dennis Johnson  29       SG 80 78 33.3 0.437 0.852 3.5 4.2 1.2 0.7 2.2 3.1   13.2

The vector of logicals can be evaluated using an expression. Instead of a hard-coded list of logical values (T/TRUE or F/FALSE), R can tell us just about anything. Which players (observations) get more than 10 rebounds per game? Which players have a field goal % above 50? Which players are over 30 years old?

> Celtics$REB >= 10
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
> Celtics[Celtics$REB >= 10,]
         Name Age Position GM GS   MP   FG%   FT%  REB AST STL BLK  TO  PF Points
Larry    Bird  27       PF 79 77 38.3 0.492 0.888 10.1 6.6 1.8 0.9 3.0 2.5   24.2
Robert Parish  30        C 80 79 35.8 0.546 0.745 10.7 1.7 0.7 1.5 2.3 3.3   19.0
> 
> Celtics[["FG%"]] >= 0.50
 [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
> Celtics[Celtics[["FG%"]] >= 0.50,]
            Name Age Position GM GS   MP   FG%   FT%  REB AST STL BLK  TO  PF Points
Cedric   Maxwell  28       SF 80 78 31.3 0.532 0.753  5.8 2.6 0.8 0.3 2.5 2.8   11.9
Gerald Henderson  28       PG 78 78 26.8 0.524 0.768  1.9 3.8 1.5 0.2 2.1 2.7   11.6
Kevin     McHale  26       PF 82 10 31.4 0.556 0.765  7.4 1.3 0.3 1.5 1.8 3.0   18.4
Robert    Parish  30        C 80 79 35.8 0.546 0.745 10.7 1.7 0.7 1.5 2.3 3.3   19.0
> 
> Celtics[["Age"]] >= 30
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
> Celtics[Celtics[["Age"]] >= 30,]
         Name Age Position GM GS   MP   FG%   FT%  REB AST STL BLK  TO  PF Points
M.L.     Carr  33       SF 60  1  9.8 0.409 0.875  1.3 0.8 0.3 0.1 0.8 1.1    3.1
Robert Parish  30        C 80 79 35.8 0.546 0.745 10.7 1.7 0.7 1.5 2.3 3.3   19.0
Scott  Wedman  31       SF 68  5 13.5 0.444 0.829  2.0 1.0 0.4 0.1 0.6 1.6    4.8

Subset Function

We'll conclude with a quick look at the subset() function. It returns subsets of vectors, matrices or data frames which meet conditions. In the example below, the first parameter is our dataframe, followed by the subset parameter, which is a logical expression indicating the rows to be returned.

> #Players that are point guards or shooting guards ("PG" or "SG")
> subset(Celtics, subset = Celtics$Position == "PG" | Celtics$Position == "SG")
            Name Age Position GM GS   MP   FG%   FT% REB AST STL BLK  TO  PF Points
Danny      Ainge  24       SG 71  3 16.3 0.460 0.821 1.6 2.3 0.6 0.1 1.0 2.0    5.4
Dennis   Johnson  29       SG 80 78 33.3 0.437 0.852 3.5 4.2 1.2 0.7 2.2 3.1   13.2
Gerald Henderson  28       PG 78 78 26.8 0.524 0.768 1.9 3.8 1.5 0.2 2.1 2.7   11.6
Quinn    Buckner  29       PG 79  0 15.8 0.427 0.649 1.7 2.7 1.1 0.0 1.3 2.4    4.1
Carlos     Clark  23       SG 31  0  4.1 0.365 0.889 0.5 0.5 0.3 0.0 0.4 0.4    1.7

Dave's Thoughts

At this point, subsetting data frames by entire rows makes more sense to me than subsetting a handful of columns or a "chunk" of the data frame. The equivalent in SQL would be SELECT * FROM TABLE, which is generally frowned upon (best practice is to include the needed column names and no more). But with R, the data frame and all the data are already in memory, right? What's the harm in returning an entire row?


Lesson 13 - Subsetting Data Frames

In the last lesson, we learned about data frames and how to create them with the data.frame() function. To recap, a data frame is a structure in R for working with a dataset. It consists of rows and columns, which are called observations and variables respectively. The variables can be a mix of different classes. Here is a data frame example that was used in Lesson 12:

> #Vectors of data for our basketball team.
> name <- c("Larry Bird","Robert Parish","Dennis Johnson","Cedric Maxwell","Gerald Henderson","Kevin McHale","Danny Ainge","M.L. Carr")
> position <- c("PF", "C", "SG", "SF", "PG", "PF", "SG", "SF")
> starter <- c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,FALSE)
> jersey <- c(33L, 00L, 3L, 31L, 43L, 32L, 44L, 30L)
> 
> #Create a data frame.
> Celtics <- data.frame(name, position, starter, jersey)
> Celtics
              name position starter jersey
1       Larry Bird       PF    TRUE     33
2    Robert Parish        C    TRUE      0
3   Dennis Johnson       SG    TRUE      3
4   Cedric Maxwell       SF    TRUE     31
5 Gerald Henderson       PG    TRUE     43
6     Kevin McHale       PF   FALSE     32
7      Danny Ainge       SG   FALSE     44
8        M.L. Carr       SF   FALSE     30

A single element of a data frame can be referenced using single brackets in this pattern: dataframe[r,c], where r is the row number, and c is the column number. c can also be the name of the corresponding column, enclosed in quotes.

> #What is Cedric Maxwell's jersey number?
> Celtics[4, 4]
[1] 31
> #Variable name of the column also works here.
> Celtics[4, "jersey"]
[1] 31

All the variables of a column can be selected by omitting the row number. As above, the column number or variable name of the column can used. Note the return is a vector (or more specifically, a factor).

> #Select a "column" by variable name or index number.
> Celtics[, "position"]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> Celtics[, 2]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> class(Celtics[, 2])
[1] "factor"

As you may have guessed, an observation can be selected by specifying the observation/row number, and omitting the column number/name.

> #Select an observation by row number.
> Celtics[3,]
            name position starter jersey
3 Dennis Johnson       SG    TRUE      3

To subset multiple rows and/or columns, use a vector of numbers (or names). In the next two examples, we subset the third and fourth columns of rows five and six, followed by the "name" and "position" of rows one and two.

> #Select multiple observations & variables.
> #Result is a data frame.
> Celtics[c(5,6), c(3, 4)]
  starter jersey
5    TRUE     43
6   FALSE     32
> Celtics[c(1,2), c("name", "position")]
           name position
1    Larry Bird       PF
2 Robert Parish        C

Columns can also be subsetted with single brackets and either the variable name of the column, or the column number. Here the result is a data frame.

> #Select all the variables of a column. 
> #The next two statements are equivalent.
> Celtics["starter"]
  starter
1    TRUE
2    TRUE
3    TRUE
4    TRUE
5    TRUE
6   FALSE
7   FALSE
8   FALSE
> Celtics[3]
  starter
1    TRUE
2    TRUE
3    TRUE
4    TRUE
5    TRUE
6   FALSE
7   FALSE
8   FALSE
> #Result is a data frame.
> class(Celtics["starter"])
[1] "data.frame"

Subsetting columns with double brackets or the $ shortcut outputs a vector (again, here it is more specifically a factor).

> #All of these are equivalent. 
> Celtics$position
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> Celtics[["position"]]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> Celtics[[2]]
[1] PF C  SG SF PG PF SG SF
Levels: C PF PG SF SG
> #Result is a vector.
> class(Celtics$position)
[1] "factor"

Extending Data Frames

Adding columns to a data frame is easy--easy compared to adding rows. We'll get to that. To add a column, first create a vector. The class doesn't matter. But the number of elements does--it has to match the number of observations in the data frame. Now that we have our vector, here are some options to add it as a new column to a data frame: use the $ shortcut, use double brackets with the new column name, bind the vector to the dataframe with cbind().

> #Create a vector...
> points <- c(24.2, 19.0, 13.2, 11.9, 11.6, 18.4, 5.4, 3.1)
> #...add it as a new column of variables.
> Celtics$Points <- points
> Celtics
              name position starter jersey Points
1       Larry Bird       PF    TRUE     33   24.2
2    Robert Parish        C    TRUE      0   19.0
3   Dennis Johnson       SG    TRUE      3   13.2
4   Cedric Maxwell       SF    TRUE     31   11.9
5 Gerald Henderson       PG    TRUE     43   11.6
6     Kevin McHale       PF   FALSE     32   18.4
7      Danny Ainge       SG   FALSE     44    5.4
8        M.L. Carr       SF   FALSE     30    3.1

> #These two options are equivalent to the above.
> Celtics[["Points"]] <- points
> Celtics <- cbind(Celtics, points)

Adding an observation to a data frame is a bit more work. We can't create a new row as a vector and add it to the data frame. This makes sense because vectors have elements of the same class. A row of a data frame can be of different classes. Instead, we have to create a data frame with one or more rows and combine the data frames with rbind(). Let's create a data frame with one row for another player and try to bind it to our existing data frame.

> new_row <- data.frame("Quinn Buckner", "PG", FALSE, 28, 4.1)
> #Note: names must match!
> Celtics <- rbind(Celtics, new_row)
Error in match.names(clabs, names(xi)) : 
  names do not match previous names

The above attempt didn't work, and the error message is rather clear. The names for each data frame must match. That is easily remedied with the names() function:

> #Sync the names.
> names(new_row) <- names(Celtics)
> Celtics <- rbind(Celtics, new_row)
> Celtics
              name position starter jersey Points
1       Larry Bird       PF    TRUE     33   24.2
2    Robert Parish        C    TRUE      0   19.0
3   Dennis Johnson       SG    TRUE      3   13.2
4   Cedric Maxwell       SF    TRUE     31   11.9
5 Gerald Henderson       PG    TRUE     43   11.6
6     Kevin McHale       PF   FALSE     32   18.4
7      Danny Ainge       SG   FALSE     44    5.4
8        M.L. Carr       SF   FALSE     30    3.1
9    Quinn Buckner       PG   FALSE     28    4.1

Dave's Thoughts

As someone without a lot of R experience, knowing the difference between single and double brackets doesn't come naturally to me yet. From my reading, I have inferred the importance of understanding whether the return type of subsetting a data frame is a vector, or another data frame. I was a little surprised (and maybe even discouraged) to see how much effort is needed to add a row to a data frame. In the .NET Framework, there is a DataTable object, which has a NewRow() function that makes adding rows pretty easy. I'm curious to see if there's something else in R that is similar.


Lesson 12 - Data Frames

Those of you with a strong background in databases will see some familiar concepts in this post. For lesson 12, we will consider datasets and structures in R that can accommodate them. Thinking back on what's been covered so far, we know vectors and matrices can't mix data of different classes. A list can contain different classes--it can contain just about anything, including another list. But it's not practical for working with datasets.

Conceptually, a dataset is a grid or table of data elements. It consists of rows, which we specifically call "observations", and of columns , which are called "variables". (Observations may also be referred to as "instances". Variables may also be referred to as "properties".) The data frame in R is designed for data sets. As the R documentation tells us, data frames are "used as the fundamental data structure by most of R's modeling software".

The function we'll be working with primarily in this post is the data.frame() function. I have read that in R programming, creating data frames with this function is rather uncommon. Most of the time, data frames are created by invoking other functions that read data from an external data source (like a file or a database table) with a data frame as the return type. But for simplicity, data.frame() will serve our purposes.

To begin, we need some data for our dataset. Here are four vectors of data for a randomly selected basketball team:

> name <- c("Larry Bird","Robert Parish","Dennis Johnson","Cedric Maxwell","Gerald Henderson","Kevin McHale","Danny Ainge","M.L. Carr")
> position <- c("PF", "C", "SG", "SF", "PG", "PF", "SG", "SF")
> starter <- c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,FALSE)
> jersey <- c(33L, 00L, 3L, 31L, 43L, 32L, 44L, 30L)

Now let's pass those vectors as arguments to data.frame() to create our first data frame:

> #Parameters for data.frame are vectors.
> Celtics <- data.frame(name, position, starter, jersey)
> #Names of the columns are taken from the variable names of the vectors.
> Celtics
              name position starter jersey
1       Larry Bird       PF    TRUE     33
2    Robert Parish        C    TRUE      0
3   Dennis Johnson       SG    TRUE      3
4   Cedric Maxwell       SF    TRUE     31
5 Gerald Henderson       PG    TRUE     43
6     Kevin McHale       PF   FALSE     32
7      Danny Ainge       SG   FALSE     44
8        M.L. Carr       SF   FALSE     30

The output of the Celtics data frame shows our observations (rows) and variables (columns) of data. By default, the variable names are taken from the vector names use in the data.frame() function. If desired, the variable names of an existing data frame can be change with the names() function. (This should be familiar by now.) Variable names can also be specified inline with the data.frame() function.

> #Use names() function, or...
> names(Celtics) <- c("Name", "Position", "Starter", "Jersey Number")
> #...specify column names inline.
> Celtics <- data.frame(Name = name, Position = position, 
+     Starter = starter, "Jersey Number" = jersey)
> Celtics
              Name Position Starter Jersey.Number
1       Larry Bird       PF    TRUE            33
2    Robert Parish        C    TRUE             0
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32
7      Danny Ainge       SG   FALSE            44
8        M.L. Carr       SF   FALSE            30

The structure of a data frame looks similar to that of a list, displaying the number of observations and variables. Take a close look and you'll see the Names and Positions are not character vectors. They're factors.

> #Output of str() looks similar to that of a list.
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : Factor w/ 8 levels "Cedric Maxwell",..: 6 8 3 1 4 5 2 7
 $ Position     : Factor w/ 5 levels "C","PF","PG",..: 2 1 5 4 3 2 5 4
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

The data.frame() function creates strings as factors by default. This behavior can be overridden by setting the stringsAsFactors parameter to FALSE:

> #Force string data to a character vector.
> Celtics <- data.frame(Name = name, Position = position, 
+     Starter = starter, "Jersey Number" = jersey,
+     stringsAsFactors = FALSE)
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : chr  "Larry Bird" "Robert Parish" "Dennis Johnson" "Cedric Maxwell" ...
 $ Position     : chr  "PF" "C" "SG" "SF" ...
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

Now the Names are all character classes. However, I think I'd actually like the Positions to be a factor--there are only five possible values. Let's fix that by changing positions to a factor and recreating the data frame:

> #Change postion from a vector to a factor.
> position <- factor(c("SF", "PF", "C", "SG", "SF", "PG", "PF", "SG"))
> #Force string data to a character vector.
> Celtics <- data.frame(Name = name, Position = position, 
+                       Starter = starter, "Jersey Number" = jersey,
+                       stringsAsFactors = FALSE)
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : chr  "Larry Bird" "Robert Parish" "Dennis Johnson" "Cedric Maxwell" ...
 $ Position     : Factor w/ 5 levels "C","PF","PG",..: 2 1 5 4 3 2 5 4
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

Our sample data set is awfully small. Outputting the entirety of the data frame is of no concern. For large data sets, the head() and tail() functions should come in handy. They output the first or last parts of a data frame:

> #head() and tail() functions return the first 
> #or last parts of the data frame.
> head(Celtics)
              Name Position Starter Jersey.Number
1       Larry Bird       PF    TRUE            33
2    Robert Parish        C    TRUE             0
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32

> tail(Celtics)
              Name Position Starter Jersey.Number
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32
7      Danny Ainge       SG   FALSE            44
8        M.L. Carr       SF   FALSE            30

We'll close with the dim() function. It retrieve the dimension of the data frame.

> #Retrieve the dimension of the data frame.
> dim(Celtics)
[1] 8 4

Dave's Thoughts

It feels like everything we've learned so far has lead to this--the data frame. Thus far, it's the closest thing to a database table that we've seen.


Lesson 11 - List Subsetting

Despite the similarities we've seen between vectors and lists, subsetting is considerably different. Let's revisit the division list from the last post. It has elements for "Name", (number of) "Teams", and "Conference". There are also five list elements--one for each team in the division.

> division <- list(Name = "Atlantic", Teams = 5L, Conference = "Eastern",
+                  list(City = "Boston", Nickname = "Celtics", Championships = 17),
+                  list(City = "New York", Nickname = "Knicks", Championships = 2),
+                  list(City = "Philadelphia", Nickname = "76ers", Championships = 1),
+                  list(City = "Brooklyn", Nickname = "Nets", Championships = 0),
+                  list(City = "Totonto", Nickname = "Raptors", Championships = 0)
+ )
> str(division)
List of 8
 $ Name      : chr "Atlantic"
 $ Teams     : int 5
 $ Conference: chr "Eastern"
 $           :List of 3
  ..$ City         : chr "Boston"
  ..$ Nickname     : chr "Celtics"
  ..$ Championships: num 17
 $           :List of 3
  ..$ City         : chr "New York"
  ..$ Nickname     : chr "Knicks"
  ..$ Championships: num 2
 $           :List of 3
  ..$ City         : chr "Philadelphia"
  ..$ Nickname     : chr "76ers"
  ..$ Championships: num 1
 $           :List of 3
  ..$ City         : chr "Brooklyn"
  ..$ Nickname     : chr "Nets"
  ..$ Championships: num 0
 $           :List of 3
  ..$ City         : chr "Totonto"
  ..$ Nickname     : chr "Raptors"
  ..$ Championships: num 0

Subsetting the list with single brackets [] for the first element returns "Atlantic". But if we take a closer look using the str() function, we see R returned the data as a class of type list:

> #Appears to return "Atlantic" as a character class.
> division[1]
$Name
[1] "Atlantic"

> #str shows us the return is actually a list of 1 element.
> str(division[1])
List of 1
 $ Name: chr "Atlantic"

Subsetting the list with double brackets [[]] for the first element also returns "Atlantic". But this time, the str() function shows us a class of type character is returned:

> #Subsetting with double brackets returns a character class.
> division[[1]]
[1] "Atlantic"
> #Verify with str()
> str(division[[1]])
 chr "Atlantic"

That's the big takeaway from this lesson: subsetting lists with single brackets [] returns a list, whereas double brackets [[]] return a single element. The above examples subsetted by ordinal position. Subsetting by name follows the same rules:

> #Subsetting by name.
> division["Name"]
$Name
[1] "Atlantic"

> str(division["Name"])
List of 1
 $ Name: chr "Atlantic"
> division[["Name"]]
[1] "Atlantic"
> str(division[["Name"]])
 chr "Atlantic"

Now let's subset a list within a list. Note that even if double brackets [[]] are used, the returned data is a list (because the corresponding element is a list itself):

> #Subset the Celtics
> division[[4]]
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

> #Subset the Knicks
> division[[5]]
$City
[1] "New York"

$Nickname
[1] "Knicks"

$Championships
[1] 2

A list can be subsetted to return more than one element. Here are the first and third elements ("Name" and "Conference") of the division list. A class of type list is returned:

> division[c(1,3)]
$Name
[1] "Atlantic"

$Conference
[1] "Eastern"

> str(division[c(1,3)])
List of 2
 $ Name      : chr "Atlantic"
 $ Conference: chr "Eastern"

Here we will subset to return a single element of a list within a list. All of the statements below are equivalent. The use of double brackets [[]] ensures the class returned matches that of the corresponding element, which in this case is numeric. Note that the $ shortcut can only be used with a named list, and is otherwise equivalent to using double brackets [[]]

> #How many championships have the Celtics won?
> #Each statement is equivalent.
> division[[4]][[3]]
[1] 17
> division[[c(4,3)]]
[1] 17
> division[[4]]$Championships
[1] 17

A vector of logicals can be used to subset a list. But this is a single brackets [] operation only. Here are the last five elements of the division list:

> division[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]
[[1]]
[[1]]$City
[1] "Boston"

[[1]]$Nickname
[1] "Celtics"

[[1]]$Championships
[1] 17


[[2]]
[[2]]$City
[1] "New York"

[[2]]$Nickname
[1] "Knicks"

[[2]]$Championships
[1] 2


[[3]]
[[3]]$City
[1] "Philadelphia"

[[3]]$Nickname
[1] "76ers"

[[3]]$Championships
[1] 1


[[4]]
[[4]]$City
[1] "Brooklyn"

[[4]]$Nickname
[1] "Nets"

[[4]]$Championships
[1] 0


[[5]]
[[5]]$City
[1] "Totonto"

[[5]]$Nickname
[1] "Raptors"

[[5]]$Championships
[1] 0

Subsetting with logicals doesn't work with double brackets [[]]:

> division[[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]]
Error in division[[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE)]] : 
  attempt to select less than one element in integerOneIndex
> division[[F]][[F]][[F]][[T]][[T]][[T]][[T]][[T]]
Error in division[[F]] : 
  attempt to select less than one element in integerOneIndex

We can add elements to an existing list. A character vector of teams that are no longer in the Atlantic Division is created and added to the division list with the name "FormerTeams".

> former_teams <- c("Buffalo Braves", "Charlotte Hornets", "Miami Heat", "Orlando Magic", "Washington Bullets/Wizards")
> division$FormerTeams <- former_teams
> str(division)
List of 9
 $ Name       : chr "Atlantic"
 $ Teams      : int 5
 $ Conference : chr "Eastern"
 $            :List of 3
  ..$ City         : chr "Boston"
  ..$ Nickname     : chr "Celtics"
  ..$ Championships: num 17
 $            :List of 3
  ..$ City         : chr "New York"
  ..$ Nickname     : chr "Knicks"
  ..$ Championships: num 2
 $            :List of 3
  ..$ City         : chr "Philadelphia"
  ..$ Nickname     : chr "76ers"
  ..$ Championships: num 1
 $            :List of 3
  ..$ City         : chr "Brooklyn"
  ..$ Nickname     : chr "Nets"
  ..$ Championships: num 0
 $            :List of 3
  ..$ City         : chr "Totonto"
  ..$ Nickname     : chr "Raptors"
  ..$ Championships: num 0
 $ FormerTeams: chr [1:5] "Buffalo Braves" "Charlotte Hornets" "Miami Heat" "Orlando Magic" ...
> #Could also have done this:
> division[["FormerTeams"]] <- former_teams

Adding an element to a list within a list works similarly to above. Here we add the name of the head coach to the fourth element in the division list:

> #Add element to an embedded list.
> division[[4]][["Coach"]] <- "Brad Stevens"
> division[[4]]
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

$Coach
[1] "Brad Stevens"

> 
> #Could also have done this:FAL
> division[[4]]$Coach <- "Brad Stevens"

Dave's Thoughts

Parts of this lesson were a little tricky IMO, but after distinguishing between single brackets [] and double brackets [[]], it all eventually made sense.

Once again we saw some examples that used the $ programming shortcut. Several times I've wondered why anyone would *not* use it--it's so much more convenient than the alternative. After giving it some thought, I have an educated guess that seems to be correct based on some code tests. Here is a brief sample that demonstrates:

> football_team <- list(Name = "UCF", NickName = "Knights", NationalChampionships = 1L)
> 
> #Reference an element by $ and name, or double brackets and name:
> football_team$Name
[1] "UCF"
> football_team[["NationalChampionships"]]
[1] 1
> 
> #Store an element name in a variable.
> element_name <- "NickName"
> #Reference by double brackets with variable works.
> football_team[[element_name]]
[1] "Knights"
> #Reference by $ and variable fails.
> football_team$element_name
NULL

In short, to use the $ programming shortcut it appears you have to know the element name at design time (when you write the code), and/or know that it will exist at run time.


Lesson 10 - Lists

In previous lessons, we've noted vectors and matrices consist of data elements of the same class. R will coerce data elements to a single class if we attempt to create a vector or matrix with data elements of differing classes. Lists, on the other hand, can hold data elements of different classes, such as the integer, character, or logical class. In fact, a list can hold most anything in R, including vectors, matrices, and many more! None to my surprise, lists can be created with the list() function:

> #Coercion with a vector
> c("Boston", "Celtics", 17)
[1] "Boston"  "Celtics" "17"     
> 
> #No coercion with a list.
> #Output is notably different from that of a vector.
> #Note the double square brackets.
> list("Boston", "Celtics", 17)
[[1]]
[1] "Boston"

[[2]]
[1] "Celtics"

[[3]]
[1] 17

If there's ever a doubt that something is a list, is.list() and/or class() will tell us with certainty:

> team <- list("Boston", "Celtics", 17)
> is.list(team)
[1] TRUE
> class(team)
[1] "list"

As with vectors and matrices, we can use the names() function to name each element of a list. The output changes accordingly:

> #Set the names of a list.
> names(team) <- c("City", "Nickname", "Championships")
> 
> #Double brackets are replaced with names.
> #Note the leading $ character.
> team
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

Like vectors, elements of a list can be referenced by ordinal position or name. There's also a shortcut using the $ character. The format is listName$elementName:

> #Reference a list element by its ordinal position or name.
> team[1]
$City
[1] "Boston"

> team["Nickname"]
$Nickname
[1] "Celtics"

> #Now we can also reference a single list element
> #as follows:
> team$Nickname
[1] "Celtics"
> class(team$City)
[1] "character"
> class(team$Championships)
[1] "numeric"

The list() function also allows names to be provided inline with the element values:

> #Names can also be specified inline with the list() function.
> team <- list(City = "Boston", Nickname = "Celtics", Championships = 17)
> team
$City
[1] "Boston"

$Nickname
[1] "Celtics"

$Championships
[1] 17

The R documentation tells us the str() function "displays the internal structure of any R object." Lists are no exception:

> #For a list, the str() function outputs the number of 
> #elements, and the name, class, and value of each element.
> str(team)
List of 3
 $ City         : chr "Boston"
 $ Nickname     : chr "Celtics"
 $ Championships: num 17

Now let's show off a little and create a nested list (list within a list). Here is the NBA's Atlantic Divsion, including the five teams that comprise it:

> division <- list(Name = "Atlantic", Teams = 5L, Conference = "Eastern",
+                  list(City = "Boston", Nickname = "Celtics", Championships = 17),
+                  list(City = "New York", Nickname = "Knicks", Championships = 2),
+                  list(City = "Philadelphia", Nickname = "76ers", Championships = 1),
+                  list(City = "Brooklyn", Nickname = "Nets", Championships = 0),
+                  list(City = "Totonto", Nickname = "Raptors", Championships = 0)
+ )
> str(division)
List of 8
 $ Name      : chr "Atlantic"
 $ Teams     : int 5
 $ Conference: chr "Eastern"
 $           :List of 3
  ..$ City         : chr "Boston"
  ..$ Nickname     : chr "Celtics"
  ..$ Championships: num 17
 $           :List of 3
  ..$ City         : chr "New York"
  ..$ Nickname     : chr "Knicks"
  ..$ Championships: num 2
 $           :List of 3
  ..$ City         : chr "Philadelphia"
  ..$ Nickname     : chr "76ers"
  ..$ Championships: num 1
 $           :List of 3
  ..$ City         : chr "Brooklyn"
  ..$ Nickname     : chr "Nets"
  ..$ Championships: num 0
 $           :List of 3
  ..$ City         : chr "Totonto"
  ..$ Nickname     : chr "Raptors"
  ..$ Championships: num 0

It's nice the way str() indents the nested lists (the five teams). This seems well suited for displaying hierarchical data.


Dave's Thoughts

The $ shortcut is nice. For those of you following along with RStudio, you should notice the intellisense feature. Here's what it looks like for me:

Dave Mason - rstats - RStudio intellisense

I see a parallel between R lists and classes in OO languages. An OO class can have one or more named properties of differing data types. A class can even have another class as a property. Sounds a lot like an R list to me.

A list in R is one dimensional. Surely there's a two dimensional class object in R, right? Hmm.


Lesson 9 - Factors

Somewhat surprisingly, I had a difficult time coming up with a strict definition for a factor. The documentation indirectly refers to it as an encoded vector. It also notes "the terms ‘category’ and ‘enumerated type’ are also used for factors". I've also seen it referred to as a data structure of categorical values. The challenges of definitions aside, a factor can be created via the factor() function. The first parameter of this function is a vector, so let's start there. Here is a vector of basketball player positions ("C" for center, "F" for forward, and "G" for guard):

> positions <- c("C", "F", "G")
> positions
[1] "C" "F" "G"

So far, so good. Now let's create a factor from the vector:

> position_factor <- factor(positions)
> position_factor
[1] C F G
Levels: C F G

The output looks similar to that of a vector, although it drops the double-quotes around the data elements. And there's that last line of "Levels". More on that in a bit. Let's revisit our "definition" of a factor. As was noted, the documentation refers to it as an "encoded vector" or an "enumerated type". To that point, there are integer values behind each element of the factor. We can see this by examining the structure of our factor using the str() function:

> positions <- c("C", "F", "G")
> position_factor <- factor(positions)
> position_factor
[1] C F G
Levels: C F G
> str(position_factor)
 Factor w/ 3 levels "C","F","G": 1 2 3

The str() function shows us the factor has three levels (categories) "C", "F", and "G" with underlying integer values of 1, 2, and 3 respectively. Let's add some more players to the vector (by their position) and re-create the factor:

> #Add more players (by their position).
> positions <- c(positions, "F", "F", "G", "C", "F", "G", "G")
> positions
 [1] "C" "F" "G" "F" "F" "G" "C" "F" "G" "G"
> #Now that the vector has more elements,
> #re-create the factor.
> position_factor <- factor(positions)
> position_factor
 [1] C F G F F G C F G G
Levels: C F G

We see the position of all ten of the players, and the three Levels remain the same. The Levels are the categories (distinct values) of the data elements. If we check the structure of the factor again, all of the data elements have a corresponding integer value.

> #Re-examine the structure of the factor.
> str(position_factor)
 Factor w/ 3 levels "C","F","G": 1 2 3 2 2 3 1 2 3 3

As I understand it, one of the benefits of this is more efficient use of memory. Consider a factor that used the full names "center", "forward", and "guard" as elements. Now imagine there were one million elements in the factor. The character values "center", "forward", and "guard" are stored in memory exactly once each, along with one million integers (as opposed to one million strings of characters).

How many players are there at each position? The summary() function comes in handy here. It shows a count of data elements by level:

> #The summary function:
> summary(position_factor)
 Center Forward   Guard 
      2       4       4 

Order

When the factor function is invoked, it loops through the elements of the vector to get the categories/levels. While it's not clear from the examples so far, by default, the levels are sorted and presented in alphabetical order. This is easily demonstrated by changing the order of the vector elements:

> #By default levels are sorted in alphabetical order.
> factor(c("G", "F", "F", "G", "C", "F", "G", "G"))
[1] G F F G C F G G
Levels: C F G

The factor function has an optional levels parameter. When specified, it dictates the order of the levels.

> #The levels argument can be specified to dictate their order.
> positions <- c("C", "F", "G", "F", "F", "G", "C", "F", "G", "G")
> position_factor <- factor(positions, levels = c("G", "F", "C"))
> position_factor
 [1] C F G F F G C F G G
Levels: G F C
> str(position_factor)
 Factor w/ 3 levels "G","F","C": 3 2 1 2 2 1 3 2 1 1

Now the levels are G, F, and C with integer values 1, 2, and 3 respectively. Another optional parameter is labels, which is used to specify the names of the levels. Let's add the labels "Guard", "Forward", and "Center":

> #The labels argument is used to change the names of the levels.
> positions <- c("C", "F", "G", "F", "F", "G", "C", "F", "G", "G")
> position_factor <- factor(positions, 
+                           labels = c("Guard", "Forward", "Center"))
> position_factor
 [1] Guard   Forward Center  Forward Forward Center  Guard   Forward Center  Center 
Levels: Guard Forward Center
> str(position_factor)
 Factor w/ 3 levels "Guard","Forward",..: 1 2 3 2 2 3 1 2 3 3

NOTE: make sure the data elements for the labels are in the right order to be in sync with the levels (whether specified explicitly or sorted automatically in alphabetical order).

Levels can be named after a factor is created via the levels() function. (This is similar to the way the names() function is used for vectors.)

> positions <- c("C", "F", "G", "F", "F", "G", "C", "F", "G", "G")
> position_factor <- factor(positions)
> position_factor
 [1] C F G F F G C F G G
Levels: C F G
> #Name the levels of an existing factor.
> levels(position_factor) <- c("Center", "Forward","Guard")
> position_factor
 [1] Center  Forward Guard   Forward Forward Guard   Center  Forward Guard   Guard  
Levels: Center Forward Guard

So to recap, the factor() function has parameters for levels and labels, and there is a separate levels() function. Clear as mud, right?


Nominal vs Ordinal

Factor data can be nominal or ordinal. In our examples so far, it is nominal. "C", "G", and "F" (and "Center", "Guard", and "Forward" for that matter) are names that have no comparative order to each other. It's not meaningful to say a Center is greater than a Forward or a Forward is less than a Guard (keep in mind these are position names--don't let height cloud your thinking). If we try making a comparison, we get a warning message:

> position_factor[1] > position_factor[2]
[1] NA
Warning message:
In Ops.factor(position_factor[1], position_factor[2]) :
  ‘>’ not meaningful for factors

Ordinal data, on the other hand, can be compared to each other in some ranked fashion--it has order. Take bed sizes, for instance. A "Twin" bed is smaller than a "Full", which is smaller than a "Queen", which is smaller than a "King". To create a factor with ordered (ranked) levels, use the ordered parameter, which is a logical flag to indicate if the levels should be regarded as ordered (in the order given). Here's a factor of bed sizes:

> #Beds factor with ordered levels.
> beds <- c("Queen", "Full", "Twin", "King", "Twin", "Full", "Twin")
> beds_factor <- factor(beds, ordered = TRUE, 
+   levels = c("Twin", "Full", "Queen", "King"))
> beds_factor
[1] Queen Full  Twin  King  Twin  Full  Twin 
Levels: Twin < Full < Queen < King

Now we can compare individual data elements of the factor. Is one element "greater" than another?

> #Is a Queen bed "larger" than a Twin?
> beds_factor[1]
[1] Queen
Levels: Twin < Full < Queen < King
> beds_factor[3]
[1] Twin
Levels: Twin < Full < Queen < King
> beds_factor[1] > beds_factor[3]
[1] TRUE

Dave's Thoughts

Long post, eh? A factor in some ways reminds me of an enum in C# or Visual Basic. Those structures don't contain data, though. I'm wondering if a factor can be a function parameter/argument, such that invalid values (levels) are rejected or ignored. The summary() function is nice.


Lesson 8 - Matrix Math

In the last lesson, it was noted that matrix subsetting has many commonalities with vector subsetting. The same holds true for matrix math, so this will be a short lesson. Let's begin with a matrix of numbers with four rows and six columns:

> some_numbers <- matrix(1:24, nrow=4, byrow = TRUE)
> some_numbers
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24

Math operations are generally done element-by-element. Here are simple examples for addition, subtraction, multiplication, and division:

> #Add 2 to each element in the matrix.
> some_numbers + 2
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    3    4    5    6    7    8
[2,]    9   10   11   12   13   14
[3,]   15   16   17   18   19   20
[4,]   21   22   23   24   25   26

> #Subtract 3 from each element in the matrix.
> some_numbers - 3
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]   -2   -1    0    1    2    3
[2,]    4    5    6    7    8    9
[3,]   10   11   12   13   14   15
[4,]   16   17   18   19   20   21

> #Multiply each element of the matrix by four.
> some_numbers * 4
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    4    8   12   16   20   24
[2,]   28   32   36   40   44   48
[3,]   52   56   60   64   68   72
[4,]   76   80   84   88   92   96

> #Divide each element of the matrix by five.
> some_numbers / 5
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]  0.2  0.4  0.6  0.8  1.0  1.2
[2,]  1.4  1.6  1.8  2.0  2.2  2.4
[3,]  2.6  2.8  3.0  3.2  3.4  3.6
[4,]  3.8  4.0  4.2  4.4  4.6  4.8

Math operations between matrices is possible too. Here, the same matrix is added to itself. Since it's the same matrix, they obviously have the same number of elements. The first element is added to the first element, the second element is added to the second element, etc.

> #Add two matrices.
> some_numbers + some_numbers
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    4    6    8   10   12
[2,]   14   16   18   20   22   24
[3,]   26   28   30   32   34   36
[4,]   38   40   42   44   46   48

When two matrices have a different number of elements, math operations result in an error:

> more_numbers <- matrix(1:20, nrow=4, byrow = TRUE)
> some_numbers + more_numbers
Error in some_numbers + more_numbers : non-conformable arrays
> some_numbers * more_numbers
Error in some_numbers * more_numbers : non-conformable arrays

We can perform math operations between a matrix and a vector, though. Our friend recycling lends a hand. Note the order of the operations: the first element of the vector is added to the matrix element at row one, column one. The second vector element is added to the matrix element at row two, column one. The third vector element is added to the matrix element at row three, column one. The vector elements are recycled, and the process continues at row four, column one of the matrix.

> some_numbers
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    7    8    9   10   11   12
[3,]   13   14   15   16   17   18
[4,]   19   20   21   22   23   24
> #Subtract a vector from the matrix.
> #Note the order of the operations...and recycling.
> some_numbers - c(1,2,3)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    3    3    3
[2,]    5    5    8    8    8   11
[3,]   10   13   13   13   16   16
[4,]   18   18   18   21   21   21

Dave's Thoughts

After learning about vector math, this lesson wasn't particularly revelatory. Hopefully you breezed through it quickly as I did.


Lesson 7 - Matrix Subsetting

In Lesson 5, we looked at vector subsetting, which uses square brackets [ ] in a variety of different ways to extract one or more elements from a vector. Those same concepts apply to matrix subsetting. The most basic usage is to extract a single matrix element in this manner: matrix[r, c] where r is the row number/index and c and is the column number/index. To demonstrate, let's use the points_scored_by_quarter matrix from the last lesson and extract the element in row 4, column 3:

> points_scored_by_quarter <- rbind(c(2,2,6,0), 
+                                   c(12,7,6,4), 
+                                   c(7,3,4,0), 
+                                   c(2,2,3,2), 
+                                   c(6,9,2,4))
> #Add column names and row names
> colnames(points_scored_by_quarter) <- c("1st", "2nd", "3rd", "4th")
> rownames(points_scored_by_quarter) <- c("Perkins", "Garnett", "Pierce", "Rondo", "Allen")
> points_scored_by_quarter
        1st 2nd 3rd 4th
Perkins   2   2   6   0
Garnett  12   7   6   4
Pierce    7   3   4   0
Rondo     2   2   3   2
Allen     6   9   2   4
> points_scored_by_quarter[4,3]
[1] 3

Elements can also be extracted using row names and column names. Let's find out how many points Kevin Garnett scored in the 1st quarter:

> points_scored_by_quarter["Garnett", "1st"]
[1] 12

We can extract an entire row from a matrix. To do this, specify the desired row only within the square brackets [ ]. The placeholder where you would otherwise specify the column is left empty.

> #Points scored by Kendrick Perkins.
> points_scored_by_quarter[1,]
1st 2nd 3rd 4th 
  2   2   6   0 
> points_scored_by_quarter["Perkins",]
1st 2nd 3rd 4th 
  2   2   6   0 

Conversely, we can extract a column from a matrix. Specify the column within the square brackets [ ] and omit the row. The result is a vector, thus the pivot effect--the row names are displayed in the output (not the column name).

> #Points scored by all players in the 4th quarter.
> points_scored_by_quarter[, 4]
Perkins Garnett  Pierce   Rondo   Allen 
      0       4       0       2       4 
> points_scored_by_quarter[, "4th"]
Perkins Garnett  Pierce   Rondo   Allen 
      0       4       0       2       4  

Another way to subset a matrix is to extract multiple elements from one or more rows. How many points did Ray Allen score in the first half (1st and 2nd quarters)? For that, we need row 5 with columns 1 and 2:

> points_scored_by_quarter[5, c(1,2)]
1st 2nd 
  6   9  
> #We can specify any order for rows and columns--reverse the column order.
> points_scored_by_quarter[5, c(2, 1)]
2nd 1st 
  9   6  

Drop

All of the subsetting examples so far have returned a vector. With this behavior, we lose row names, column names, or both. To have a matrix returned (including row names and column names), include the parameter drop = FALSE.

> points_scored_by_quarter[4,3, drop = FALSE]
      3rd
Rondo   3
> points_scored_by_quarter[1, c(4, 3), drop = FALSE]
        4th 3rd
Perkins   0   6
> #When extracting a column, there's no "pivot" with drop = FALSE
> points_scored_by_quarter[,4, drop = FALSE]
        4th
Perkins   0
Garnett   4
Pierce    0
Rondo     2
Allen     4

Multiple Rows and Columns

If our subsetting specifies multiple rows and multiple columns, a matrix is returned without the need to include the drop = FALSE parameter.

> #2nd half points scored by the guards.
> points_scored_by_quarter[c("Allen", "Rondo"), c(3,4)]
      3rd 4th
Allen   2   4
Rondo   3   2
> #Points scored in all quarters by the frontcourt players.
> points_scored_by_quarter[1:3, 1:4]
        1st 2nd 3rd 4th
Perkins   2   2   6   0
Garnett  12   7   6   4
Pierce    7   3   4   0 

Rows and columns can also be specified with logical vectors.

> #Extract rows 1,2, and 5 along with columns 1 and 4.
> points_scored_by_quarter[c(TRUE, TRUE, FALSE, FALSE, TRUE), c(TRUE, FALSE, FALSE, TRUE)]
        1st 4th
Perkins   2   0
Garnett  12   4
Allen     6   4 

As with vectors, recycling applies with matrices.

> #The logical vector for columns has 2 elements. They are both used, 
> #the vector recycles, and the 2 elements are used a second time.
> points_scored_by_quarter[c(TRUE, TRUE, FALSE, FALSE, TRUE), c(TRUE, FALSE)]
        1st 3rd
Perkins   2   6
Garnett  12   6
Allen     6   2

Dave's Thoughts

The drop = FALSE parameter was nice to learn about. That explained a few things for me from the prior lessons. Other than that, most everything here was as expected, based on what was learned with vectors.