### Lesson 15 - Beginning Graphics

Everything covered so far has been textual in nature-either code written or output displayed in the console. For this post I'll take a look at some functions for visualizing data in R. I've fretted about this post for a while. Data visualizations in R is surely a huge topic, far too big for a single post. With that in mind, I've chosen to keep things as simple as possible. For each function, there will be two examples: one with the bare minimum of arguments, and another with some optional parameters that I found interesting. Examples will use the following data frame:

> Playoffs League TotalTeams PlayoffTeams PercentPlayoffQualifiers 1 MLB 30 10 33.333333 2 NBA 30 16 53.333333 3 NCAA-FBS 129 4 3.100775 4 NCAA-FCS 124 24 19.354839 5 NCAAM 351 64 18.233618 6 NCAAW 349 64 18.338109 7 NHL 31 16 51.612903 8 NFL 32 12 37.500000 9 WNBA 12 8 66.666667

### Plots

A few words about plot() from the documentation:

"Generic function for plotting of R objects... For simple scatter plots, plot.default will be used. However, there are plot methods for many R objects, including functions, data.frames, density objects, etc."

Here is a plot showing the percentage of teams that qualify for the playoffs in each league:

```
> #Plot the percentage of teams in each
> #league that qualify for the playoffs.
> plot(Playoffs$PercentPlayoffQualifiers)
```

The "Index" on the horizontal/x-axis corresponds to the row numbers in the data frame, while the vertical/y-axis corresponds to the percentage data in the PercentPlayoffQualifiers column of the data frame. Here is the plot of the same data, with the ylab and main parameters:

```
> plot(Playoffs$PercentPlayoffQualifiers,
+ ylab = "Percentage",
+ main = "Sports Leagues\n% of teams qualifying for playoffs")
```

### Histograms

The R documentation for the hist() function is short and succint:

"The generic function hist computes a histogram of the given data values."

To add to that, here is a definition of a histogram from Wikipedia:

"A histogram is an accurate representation of the distribution of numerical data."

Here is a histogram showing the distribution of data for the number of teams that qualify for the playoffs in each league:

```
> #Histogram for the number of playoff teams.
> hist(Playoffs$PlayoffTeams)
```

The first bar tells us there are three leagues with 0 to 10 teams that qualify for the playoffs. The second bar tells us there are three leagues with 11 to 20 teams, etc. Here is a histogram of the same data, with the main, xlab, col, and breaks parameters:

```
> hist(Playoffs$PlayoffTeams,
+ main = "Sports League Playoffs",
+ xlab = "Number of Qualifying Teams",
+ col = "gray",
+ breaks = 9)
```

### Bar Plots

The R documentation is also brief for the hist() function:

"Creates a bar plot with vertical or horizontal bars."

Bar plots would seem to be the same as another common data visualization: the "bar chart". Here is a bar plot showing the percentage of teams that qualify for the playoffs, by league:

```
> #Percentage of playoff qualifiers, by league.
> barplot(Playoffs$PercentPlayoffQualifiers)
```

From left to right, each bar corresponds to a row from first to last in the source data frame. Here is the same bar plot with the ylab, main, and col parameters:

```
> #Percentage of playoff qualifiers, by league.
> barplot(Playoffs$PercentPlayoffQualifiers,
+ ylab = "Percentage",
+ main = "Sports Leagues\n% of teams in playoff",
+ col = "blue")
```

### Box Plots

One final trip to the R documentation, this time for boxplot():

Produce box-and-whisker plot(s) of the given (grouped) values.

To add to that, Wikipedia tells us a box plot is...

"a method for graphically depicting groups of numerical data through their quartiles."

Here is the boxplot() function, with a vector of integers:

```
> #Percentage of playoff qualifiers, by league.
> boxplot(Playoffs$PlayoffTeams)
```

That visualization doesn't tell us much. Let's try it again with the formula Playoffs$PercentPlayoffQualifiers ~ Playoffs$League. The numeric values of Playoffs$PercentPlayoffQualifiers get split into groups according to the grouping values of Playoffs$League:

```
> #Percentage of playoff qualifiers, by league.
> boxplot(Playoffs$PercentPlayoffQualifiers ~ Playoffs$League,
+ border = c("Red", "Blue"))
```

This time, there's some sense to be made of my boxplot example, including League names along the horizontal/x-axis. But the boxplot still remains a question mark to me. Perhaps the data doesn't lend itself to a good demonstration. Perhaps there's far more to the boxplot than a few lines of code can convey. So be it.

### Dave's Thoughts

Professionally, I'm accustomed to interpreting and analyzing data non-visually in the SQL Server Management Studio query results grid. I've dabbled with a few different visualizations, including Excel charts, Google Charts, and even Tableau. I can now add R to that short list.

I've seen the topic of ethics in data science a number of times. I'm not sure if visualizations apply to that topic, but I do find it interesting that the same set of data can be displayed with a significant difference by manipulating parameters of a visualization function. If you hold the keys to the visualization tools and functions, can you have a data set paint any picture you want?