Lesson 12 - Data Frames

September 13, 2018 0 Comments

Those of you with a strong background in databases will see some familiar concepts in this post. For lesson 12, we will consider datasets and structures in R that can accommodate them. Thinking back on what's been covered so far, we know vectors and matrices can't mix data of different classes. A list can contain different classes--it can contain just about anything, including another list. But it's not practical for working with datasets.

Conceptually, a dataset is a grid or table of data elements. It consists of rows, which we specifically call "observations", and of columns , which are called "variables". (Observations may also be referred to as "instances". Variables may also be referred to as "properties".) The data frame in R is designed for data sets. As the R documentation tells us, data frames are "used as the fundamental data structure by most of R's modeling software".

The function we'll be working with primarily in this post is the data.frame() function. I have read that in R programming, creating data frames with this function is rather uncommon. Most of the time, data frames are created by invoking other functions that read data from an external data source (like a file or a database table) with a data frame as the return type. But for simplicity, data.frame() will serve our purposes.

To begin, we need some data for our dataset. Here are four vectors of data for a randomly selected basketball team:

> name <- c("Larry Bird","Robert Parish","Dennis Johnson","Cedric Maxwell","Gerald Henderson","Kevin McHale","Danny Ainge","M.L. Carr")
> position <- c("PF", "C", "SG", "SF", "PG", "PF", "SG", "SF")
> starter <- c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,FALSE)
> jersey <- c(33L, 00L, 3L, 31L, 43L, 32L, 44L, 30L)

Now let's pass those vectors as arguments to data.frame() to create our first data frame:

> #Parameters for data.frame are vectors.
> Celtics <- data.frame(name, position, starter, jersey)
> #Names of the columns are taken from the variable names of the vectors.
> Celtics
              name position starter jersey
1       Larry Bird       PF    TRUE     33
2    Robert Parish        C    TRUE      0
3   Dennis Johnson       SG    TRUE      3
4   Cedric Maxwell       SF    TRUE     31
5 Gerald Henderson       PG    TRUE     43
6     Kevin McHale       PF   FALSE     32
7      Danny Ainge       SG   FALSE     44
8        M.L. Carr       SF   FALSE     30

The output of the Celtics data frame shows our observations (rows) and variables (columns) of data. By default, the variable names are taken from the vector names use in the data.frame() function. If desired, the variable names of an existing data frame can be change with the names() function. (This should be familiar by now.) Variable names can also be specified inline with the data.frame() function.

> #Use names() function, or...
> names(Celtics) <- c("Name", "Position", "Starter", "Jersey Number")
> #...specify column names inline.
> Celtics <- data.frame(Name = name, Position = position, 
+     Starter = starter, "Jersey Number" = jersey)
> Celtics
              Name Position Starter Jersey.Number
1       Larry Bird       PF    TRUE            33
2    Robert Parish        C    TRUE             0
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32
7      Danny Ainge       SG   FALSE            44
8        M.L. Carr       SF   FALSE            30

The structure of a data frame looks similar to that of a list, displaying the number of observations and variables. Take a close look and you'll see the Names and Positions are not character vectors. They're factors.

> #Output of str() looks similar to that of a list.
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : Factor w/ 8 levels "Cedric Maxwell",..: 6 8 3 1 4 5 2 7
 $ Position     : Factor w/ 5 levels "C","PF","PG",..: 2 1 5 4 3 2 5 4
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

The data.frame() function creates strings as factors by default. This behavior can be overridden by setting the stringsAsFactors parameter to FALSE:

> #Force string data to a character vector.
> Celtics <- data.frame(Name = name, Position = position, 
+     Starter = starter, "Jersey Number" = jersey,
+     stringsAsFactors = FALSE)
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : chr  "Larry Bird" "Robert Parish" "Dennis Johnson" "Cedric Maxwell" ...
 $ Position     : chr  "PF" "C" "SG" "SF" ...
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

Now the Names are all character classes. However, I think I'd actually like the Positions to be a factor--there are only five possible values. Let's fix that by changing positions to a factor and recreating the data frame:

> #Change postion from a vector to a factor.
> position <- factor(c("SF", "PF", "C", "SG", "SF", "PG", "PF", "SG"))
> #Force string data to a character vector.
> Celtics <- data.frame(Name = name, Position = position, 
+                       Starter = starter, "Jersey Number" = jersey,
+                       stringsAsFactors = FALSE)
> str(Celtics)
'data.frame': 8 obs. of  4 variables:
 $ Name         : chr  "Larry Bird" "Robert Parish" "Dennis Johnson" "Cedric Maxwell" ...
 $ Position     : Factor w/ 5 levels "C","PF","PG",..: 2 1 5 4 3 2 5 4
 $ Starter      : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Jersey.Number: int  33 0 3 31 43 32 44 30

Our sample data set is awfully small. Outputting the entirety of the data frame is of no concern. For large data sets, the head() and tail() functions should come in handy. They output the first or last parts of a data frame:

> #head() and tail() functions return the first 
> #or last parts of the data frame.
> head(Celtics)
              Name Position Starter Jersey.Number
1       Larry Bird       PF    TRUE            33
2    Robert Parish        C    TRUE             0
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32

> tail(Celtics)
              Name Position Starter Jersey.Number
3   Dennis Johnson       SG    TRUE             3
4   Cedric Maxwell       SF    TRUE            31
5 Gerald Henderson       PG    TRUE            43
6     Kevin McHale       PF   FALSE            32
7      Danny Ainge       SG   FALSE            44
8        M.L. Carr       SF   FALSE            30

We'll close with the dim() function. It retrieve the dimension of the data frame.

> #Retrieve the dimension of the data frame.
> dim(Celtics)
[1] 8 4

Dave's Thoughts

It feels like everything we've learned so far has lead to this--the data frame. Thus far, it's the closest thing to a database table that we've seen.

0 comments: