In 2020, I set a goal of reading 30 books. Aided by a last minute charge, I managed to hit this number. I finished my 30th book on December 31st.

As I was finishing up my year of reading, I started thinking about some of the statistics of my year in books:

- On average, how many pages did I read per day?
- Did I have any slumps during the year? If so, could the slumps be explained?
- What would be a reasonable reading goal for 2021?

I tracked all of my books using Goodreads, so I started poking around on the Goodreads website to see if I could access my library.

I used the ** tidyverse**,

`lubridate`

`scales`

Getting Goodreads data isn’t too difficult. They have a great export tool, and if you follow this link, you can export your library. If you have a lot of books in your library the export can take a long time. The data export comes with 31 columns.

For this analysis, the columns I’m interested in are **Date Read**, **My Rating** (what I rated the book, 0-5 stars), **Average Rating**, **Number of Pages**, and **Original Publication Year**. I added my data to a GitHub repository.

One thing that’s missing from the Goodreads data export is the description of the book. I wrote a python script that uses BeautifulSoup to scrape Goodreads for this information. I don’t use it in *this post*, but I could see using it in a different post down the road.

The data from Goodreads is mostly good to go, but there are a few tweaks to make before getting started.

```
gh_link1 <- "https://github.com/bgstieber/files_for_blog/raw/master/"
gh_link2 <- "goodreads-data-analysis/goodreads_library_export.csv"
goodreads_data <- read_csv(paste0(gh_link1, gh_link2)) %>%
# fix issue with data export for a book
mutate(`Number of Pages` = ifelse(grepl("Be a Player", Title),
256, `Number of Pages`))
books_2020 <- goodreads_data %>%
# only 2020 books
filter(year(`Date Read`) == 2020) %>%
# create rating_diff and publish_year columns
mutate(rating_diff = `My Rating` - `Average Rating`,
publish_year = coalesce(`Original Publication Year`,
`Year Published`)) %>%
# clean some column names
rename(date_read = `Date Read`,
page_count = `Number of Pages`,
avg_rating = `Average Rating`,
my_rating = `My Rating`) %>%
# add when the previous book was finished, sort then lag
arrange(date_read) %>%
mutate(previous_book_date = lag(date_read))
```

For this analysis, I make the assumption that I read only one book at a time (not always true), and that I start reading a book immediately after I finish the previous one (not always true either).

Here is the timeline of my year in books:

**Sometimes the most basic data visualizations present the most compelling information.**

Here are a few things that stood out to me:

- My sprint at the end of the year to hit my reading goal
- A few books with longer read times:
*The Remains of the Day*,*Never Let Me Go*, and*Be a Player: How to Become a Better Golfer Every Time You Play*(to a lesser extent)- These will come up again in calculating my 2021 goal

- Apart from the few books mentioned above, I had pretty consistent read times for my 2020 books. What might be driving this?

In the code below, I create a `data.frame`

with cumulative pages and books read by date.

```
summary_by_date <- books_2020 %>%
group_by(date_read, Title) %>%
summarise(pages = sum(page_count),
books = n()) %>%
ungroup() %>%
# add dummy data for beginning of year
bind_rows(tibble(date_read = as.Date("2020-01-01"),
Title = NA_character_,
pages = 0,
books = 0)) %>%
arrange(date_read) %>%
mutate(previous_date = lag(date_read)) %>%
mutate(days_since_last_book = as.numeric(difftime(
date_read, previous_date, units = "days"
))) %>%
mutate(cumu_pages_read = cumsum(pages),
cumu_books_read = cumsum(books))
```

Using this data, I can look at my progress toward 30 books through the year.

My reading certainly slowed down during the summer months. Most of this is due to me doing other things during a beautiful Wisconsin summer like playing golf and riding my bike. Between January and May, I read an average of 39.6 pages per day, between June and September, I read about 14.8 pages per day, and finishing off the year, I read 31.3 pages per day from October through the end of the year.

For most of the year, I had a fairly consistent book-finishing pace. I think a lot of this can be explained by choosing shorter books in 2020. 70% of the books I read this year were less than 400 pages long.

Another interesting aspect of the books I read in 2020 was that they were mostly modern. 80% of the books I read in 2020 were published in 1990 or later.

```
books_2020 %>%
ggplot(aes(publish_year))+
geom_bar()+
xlab("Year Published")+
ylab("Books")+
ggtitle("When were my 2020 reads published?",
subtitle = paste0(percent(mean(books_2020$publish_year >= 1990)),
" of books I read in 2020 were published ",
"in 1990 or later."))
```

The oldest book I read was *The House of Mirth* by Edith Wharton, published in 1905. The most recent book I read was *The Art of Solitude* by Stephen Batchelor, published in 2020.

Finally, let’s take a look at how my rating of a book compared to the average rating from other Goodreads users.

```
books_2020 %>%
mutate(title_abbrev =
ifelse(nchar(Title) > 60,
paste0(substr(Title, 1, 60), "..."),
Title)) %>%
ggplot(aes(reorder(title_abbrev, rating_diff),
rating_diff,
fill = factor(my_rating)))+
geom_col(colour = "black")+
coord_flip()+
scale_fill_viridis_d("My Rating", option = "cividis")+
xlab("")+
ylab("My Rating - Goodreads Avg")+
theme(legend.position = "top",
axis.text.y = element_text(size = 8))+
ggtitle("My Rating Versus the Goodreads Average")
```

My average rating in 2020 was 4, the average Goodreads rating of the books I read in 2020 was 4.1. I gave 9 books 3 stars, 11 books a rating of 4 stars, and I gave 10 books 5 stars.

This post has mostly been an exploratory analysis of my Goodreads data. To make it actionable, let’s focus on **setting a data-driven reading goal for 2021**.

To start, let’s look at the average number of pages I was reading throughout the year.

I was reading at a pretty consistent pace in the beginning of the year, declined sharply during the warm summer months, and then picked back up at the end of the year.

On average, it took me about 12.2 days to finish a book in 2020. I read at a pace of about 28.9 pages per day.

There were a few clear outliers with respect to reading pace throughout the year. I read two novels (*The Remains of the Day* and *Never Let Me Go*, both by Kazuo Ishiguro) very slowly, taking 43 and 28 days to finish those books, respectively. I also read two books at a very fast pace (*Red Queen* and *The Art of Solitude*), where I was reading at a pace of 76.6 and 66.7 pages per day, respectively.

If we eliminate those four books, we’re left with a set of books that more closely reflects my typical or baseline reading pace. Looking at the remaining 26 books, I was reading at a pace of about 32.9 pages per day, taking about 11 days to finish a book.

Using the pace of 11 days to finish a book, I could create a goal of reading 365/11 = 33.2 books in 2021. Rounding up, I’ll set a goal of 34 books in 2021.

This represents an increase of 13% over my goal last year, which seems pretty reasonable based on this analysis.

In 2020, I set a goal to finish 30 books. On December 31st, I finished *The Art of Solitude* and completed my reading goal. I explored my Goodreads data to summarize my year in books:

- I read a total of 10,536 pages in 2020, the average length of a book I read in 2020 was 351.2 pages
- I read at a pace of 28.9 pages per day
- On average it took me about 12.2 days to complete each book
- The longest it took me to finish a book was 43 days (
*Never Let Me Go*), my shortest read time was 3 days (*The Art of Solitude*) - My average rating was 4 stars, the average Goodreads rating of the books I read was 4.1 stars

I also used the Goodreads data to set a data-driven reading goal for 2021. I hope to increase my reading by 13% in 2021 by finishing 34 books.

This was a fun way to look back on my year in books for 2020. There are a few aspects of this data that I could look into like the distribution of genres, the text summary of the book, and text reviews from other Goodreads users. That analysis will have to wait for another day!

Happy reading!

This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a short introduction to the ** data.table** package.

Although the primary data wrangling package I use is ** tidyverse**, it’s worthwhile to explore other packages that do similar data manipulations. The closest “competitor” to the

`tidyverse`

`data.table`

Three of the main selling points for using ** data.table** are that it’s

- Fast
- Concise
- We’ll go through a few examples using the
syntax`data.table`

- We’ll go through a few examples using the
- Efficient
- Works well with large data

These are three qualities we look for in data manipulation.

If you’re frustrated by how verbose data manipulation chains can get using ** tidyverse** packages,

`data.table`

Here are the packages we’ll need for this post.

```
library(data.table)
library(nycflights13)
library(microbenchmark)
library(tidyverse)
```

The data set we’ll be working with in this post comes from the ** nycflights13** package. It shows on-time data for all flights that departed NYC in 2013.

`(flights_dt <- as.data.table(flights)) # convert to a data.table`

```
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## ---
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## ---
## 336775: 1344 NA MQ 3572 N511MQ LGA CLE NA
## 336776: 1020 NA MQ 3531 N839MQ LGA RDU NA
## distance hour minute time_hour
## 1: 1400 5 15 2013-01-01 05:00:00
## 2: 1416 5 29 2013-01-01 05:00:00
## ---
## 336775: 419 11 59 2013-09-30 11:00:00
## 336776: 431 8 40 2013-09-30 08:00:00
```

`paste(dim(flights_dt), c("rows", "columns"))`

`## [1] "336776 rows" "19 columns"`

If you’re familiar with `SQL`

, `data.table`

syntax should make a good amount of sense. The syntax allows you to do a lot more than the common operations we expect with a base `data.frame`

. Here is the general form of `data.table`

syntax:

`DT[i, j, by]`

`i`

: where (subset) / order by (sort)`j`

: select (grab certain columns) / update (add/modify columns)`by`

: group by

Image source: Blazing Fast Data Wrangling With R data.table

To demonstrate, let’s take a look at each of these components.

`i`

The first three examples look at using `i`

to filter/subset your data.

```
# flights departing in January
flights_dt[month == 1]
```

```
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013 1 1 517 515 2 830 819
## 2: 2013 1 1 533 529 4 850 830
## ---
## 27003: 2013 1 31 NA 1446 NA NA 1757
## 27004: 2013 1 31 NA 625 NA NA 934
## arr_delay carrier flight tailnum origin dest air_time distance hour
## 1: 11 UA 1545 N14228 EWR IAH 227 1400 5
## 2: 20 UA 1714 N24211 LGA IAH 227 1416 5
## ---
## 27003: NA UA 337 <NA> LGA IAH NA 1416 14
## 27004: NA UA 1497 <NA> LGA IAH NA 1416 6
## minute time_hour
## 1: 15 2013-01-01 05:00:00
## 2: 29 2013-01-01 05:00:00
## ---
## 27003: 46 2013-01-31 14:00:00
## 27004: 25 2013-01-31 06:00:00
```

```
# flights departing on March 10th
flights_dt[month == 3 & day == 10]
```

```
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013 3 10 6 2359 7 336 338
## 2: 2013 3 10 41 2100 221 230 2257
## ---
## 907: 2013 3 10 NA 2000 NA NA 2335
## 908: 2013 3 10 NA 1730 NA NA 1923
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1: -2 B6 727 N547JB JFK BQN 186 1576 23 59
## 2: 213 EV 4368 N14116 EWR DAY 82 533 21 0
## ---
## 907: NA UA 424 <NA> EWR SAT NA 1569 20 0
## 908: NA US 449 <NA> EWR CLT NA 529 17 30
## time_hour
## 1: 2013-03-10 23:00:00
## 2: 2013-03-10 21:00:00
## ---
## 907: 2013-03-10 20:00:00
## 908: 2013-03-10 17:00:00
```

```
# flights where the total delay (dep_delay + arr_delay) is 10 minutes or more,
# the destination was Dallas (DFW) and was in January, February, or March
flights_dt[(dep_delay + arr_delay) >= 10 & dest == "DFW" & month <= 3]
```

```
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013 1 1 559 600 -1 941 910
## 2: 2013 1 1 635 635 0 1028 940
## ---
## 574: 2013 3 31 1308 1300 8 1622 1605
## 575: 2013 3 31 1923 1825 58 2202 2127
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1: 31 AA 707 N3DUAA LGA DFW 257 1389 6 0
## 2: 48 AA 711 N3GKAA LGA DFW 248 1389 6 35
## ---
## 574: 17 AA 745 N3CAAA LGA DFW 218 1389 13 0
## 575: 35 UA 1221 N35271 EWR DFW 196 1372 18 25
## time_hour
## 1: 2013-01-01 06:00:00
## 2: 2013-01-01 06:00:00
## ---
## 574: 2013-03-31 13:00:00
## 575: 2013-03-31 18:00:00
```

We can also sort using `i`

:

```
# sort by total delay
flights_dt[order(dep_delay + arr_delay, decreasing = T)]
```

```
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 9 641 900 1301 1242
## 2: 2013 6 15 1432 1935 1137 1607
## ---
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 1530 1272 HA 51 N384HA JFK HNL 640
## 2: 2120 1127 MQ 3535 N504MQ JFK CMH 74
## ---
## 336775: 1344 NA MQ 3572 N511MQ LGA CLE NA
## 336776: 1020 NA MQ 3531 N839MQ LGA RDU NA
## distance hour minute time_hour
## 1: 4983 9 0 2013-01-09 09:00:00
## 2: 483 19 35 2013-06-15 19:00:00
## ---
## 336775: 419 11 59 2013-09-30 11:00:00
## 336776: 431 8 40 2013-09-30 08:00:00
```

`j`

This first example shows how to select a column. It looks very similar to what we’d do in base R.

```
# get flight destination
destination <- flights_dt[, dest]
head(destination)
```

`## [1] "IAH" "IAH" "MIA" "BQN" "ATL" "ORD"`

You’ll notice that the result of the previous operation was a vector. Sometimes this is what we want, other times it’s not. So, how can we select a column and have a `data.table`

returned instead of a vector?

We can use `.(column_I_want, another_column_I_want)`

or `list(this_column_too, and_this_one_also)`

.

```
# use .(columns_to_select) or list(columns_to_select)
# .(columns_to_select) acts as "shorthand" for list(columns_to_select)
flights_dt[,.(dest)]
```

```
## dest
## 1: IAH
## 2: IAH
## ---
## 336775: CLE
## 336776: RDU
```

```
# we can select multiple columns using .()
flights_dt[,.(year, month, day, origin, dest)]
```

```
## year month day origin dest
## 1: 2013 1 1 EWR IAH
## 2: 2013 1 1 LGA IAH
## ---
## 336775: 2013 9 30 LGA CLE
## 336776: 2013 9 30 LGA RDU
```

```
# rename columns
flights_dt[,.(Origin = origin, Destination = dest)]
```

```
## Origin Destination
## 1: EWR IAH
## 2: LGA IAH
## ---
## 336775: LGA CLE
## 336776: LGA RDU
```

We can create columns using `:=`

:

```
# create total delay column
flights_dt[,total_delay := arr_delay + dep_delay]
```

One major difference between “standard” operations in R and some operations in `data.table`

is that `data.table`

will make **modifications in place**, meaning we don’t have to use the assignment operator (`<-`

or `=`

).

If we inspect `flights_dt`

, we can confirm that the `total_delay`

column was added.

`flights_dt`

```
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## ---
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## ---
## 336775: 1344 NA MQ 3572 N511MQ LGA CLE NA
## 336776: 1020 NA MQ 3531 N839MQ LGA RDU NA
## distance hour minute time_hour total_delay
## 1: 1400 5 15 2013-01-01 05:00:00 13
## 2: 1416 5 29 2013-01-01 05:00:00 24
## ---
## 336775: 419 11 59 2013-09-30 11:00:00 NA
## 336776: 431 8 40 2013-09-30 08:00:00 NA
```

We can remove a column by setting it `:=`

to `NULL`

.

```
# remove that column
flights_dt[,total_delay:=NULL]
flights_dt
```

```
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## ---
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## ---
## 336775: 1344 NA MQ 3572 N511MQ LGA CLE NA
## 336776: 1020 NA MQ 3531 N839MQ LGA RDU NA
## distance hour minute time_hour
## 1: 1400 5 15 2013-01-01 05:00:00
## 2: 1416 5 29 2013-01-01 05:00:00
## ---
## 336775: 419 11 59 2013-09-30 11:00:00
## 336776: 431 8 40 2013-09-30 08:00:00
```

```
# add multiple columns
flights_dt[, `:=`(date = lubridate::ymd(paste(year, month, day, sep = "-")),
log_distance = log(distance),
air_time_in_hours = air_time / 60)]
flights_dt
```

```
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## ---
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: 819 11 UA 1545 N14228 EWR IAH 227
## 2: 830 20 UA 1714 N24211 LGA IAH 227
## ---
## 336775: 1344 NA MQ 3572 N511MQ LGA CLE NA
## 336776: 1020 NA MQ 3531 N839MQ LGA RDU NA
## distance hour minute time_hour date log_distance
## 1: 1400 5 15 2013-01-01 05:00:00 2013-01-01 7.244228
## 2: 1416 5 29 2013-01-01 05:00:00 2013-01-01 7.255591
## ---
## 336775: 419 11 59 2013-09-30 11:00:00 2013-09-30 6.037871
## 336776: 431 8 40 2013-09-30 08:00:00 2013-09-30 6.066108
## air_time_in_hours
## 1: 3.783333
## 2: 3.783333
## ---
## 336775: NA
## 336776: NA
```

`by`

In the first two examples, we use `.N`

, which is a special symbol which allows us to count rows in our data. `.SD`

, which is used later on in this post, is also a special symbol in ** data.table**.

A simple example, counting by origin of flight.

`flights_dt[,.N,origin]`

```
## origin N
## 1: EWR 120835
## 2: LGA 104662
## 3: JFK 111279
```

A little more complicated, counting by origin and destination, then sorting to show most frequent, then slice top 10 rows.

`flights_dt[,.N, .(origin, dest)][order(-N)][1:10]`

```
## origin dest N
## 1: JFK LAX 11262
## 2: LGA ATL 10263
## 3: LGA ORD 8857
## 4: JFK SFO 8204
## 5: LGA CLT 6168
## 6: EWR ORD 6100
## 7: JFK BOS 5898
## 8: LGA MIA 5781
## 9: JFK MCO 5464
## 10: EWR BOS 5327
```

To wrap up this section, let’s show the median and average total delay by origin and destination airport, and then sort by average total delay. We also add in `.N`

, because it’s always good to show the sample size.

```
flights_dt[!is.na(arr_delay) & !is.na(dep_delay),
.(avg_delay = mean(arr_delay + dep_delay),
median_delay = median(arr_delay + dep_delay),
.N),
.(origin, dest)
][order(-avg_delay)]
```

```
## origin dest avg_delay median_delay N
## 1: EWR TYS 82.80511 18 313
## 2: EWR CAE 78.94681 40 94
## ---
## 222: JFK PSP -15.66667 -14 18
## 223: LGA LEX -31.00000 -31 1
```

We often want to perform multiple operations on a single `data.frame`

. If we keep all of the code to perform these operations on a single line, our scripts can become illegible and unwieldy. Similar to how ** tidyverse** pipes might span multiple lines:

```
data %>%
mutate(new_columns) %>%
group_by(grouping_columns) %>%
summarise(other_columns) %>%
arrange(desc(some_column))
```

We can “chain” ** data.table** expressions:

```
DT[ ...
][ ...
][ ...
]
```

This example gets the cumulative total delay over the course of a year by origin airport. It utilizes filtering, sorting, and grouping.

```
# get cumulative delay by origin airport
# uses "chaining"
cumulative_delay_by_origin <-
flights_dt[!is.na(dep_delay) & !is.na(arr_delay) # keep valid flights
][order(time_hour), # sort by date
.(time_hour, # select date and cumsum delay
cumu_delay=cumsum(arr_delay+dep_delay)),
origin] # group by origin airport
ggplot(cumulative_delay_by_origin,
aes(time_hour, cumu_delay/60, colour = origin))+
geom_line()+
theme_bw()+
xlab("Date") + ylab("Cumulative total delay (hours)")
```

Let’s get even crazier with chaining.

The next example finds the “biggest loser” on each day (i.e. which flight had the worst total delay). We then count up (using the `ones`

column) which origin airport the biggest loser was departing from. We calculate this cumulatively over the course of the year.

```
top_delay <- flights_dt[!is.na(arr_delay) & !is.na(dep_delay)
][,`:=`(total_delay=arr_delay+dep_delay, ones=1)
][, .SD[ which.max(total_delay) ], date
][order(date)
][,.(cumu_obs = cumsum(ones), date),.(origin)]
ggplot(top_delay, aes(date, cumu_obs, colour = origin))+
geom_line()+
theme_bw()+
xlab("Date")+
ylab("Cumulative # of Days with Worst Delay")
```

`R`

`tidyverse`

`data.table`

Let’s demonstrate a typical calculation you might do in R: an aggregation of two columns based on grouping by three columns. In this specific example, we’re calculating the average departure delay and average arrival delay by origin airport, destination airport, and month of flight.

We use the ** microbenchmark** package to time how long it takes to perform the different operations. We can then take the results and visualize them.

```
set.seed(1848)
benchmark_data <- microbenchmark(
# base R solution
base_R = aggregate(list(flights$dep_delay, flights$arr_delay),
list(flights$origin, flights$dest, flights$month),
mean, na.rm = TRUE),
# tidyverse solution
tidy_verse = flights %>%
group_by(origin, dest, month) %>%
summarise_at(c("dep_delay", "arr_delay"), mean, na.rm = TRUE),
#data.table
data_table = flights_dt[,lapply(.SD, mean, na.rm = TRUE),
.(origin, dest, month),
.SDcols = c("dep_delay", "arr_delay")],
times = 100)
```

```
library(ggridges)
benchmark_data_dt <- as.data.table(benchmark_data)
benchmark_data_dt[,time_in_ms := time / 1000000]
ggplot(benchmark_data_dt, aes(x = time_in_ms, y = expr))+
geom_density_ridges2(rel_min_height = 0.01, scale = 1.5, fill = "#c5050c")+
geom_boxplot(width = 0.25)+
theme_bw()+
scale_x_log10("Time (milliseconds, log10)")+
scale_y_discrete("Operation",
labels = c("base_R" = "Base R",
"tidy_verse" = "tidyverse",
"data_table" = "data.table"))+
ggtitle("Grouped Aggregation Timing Summary")
```

Here is how much slower the median operation is compared to `data.table`

. Since this *is* a tutorial about ** data.table**, we use its

`dcast`

function to convert long data to wide.```
dcast(benchmark_data_dt,
.~expr,
value.var = "time_in_ms",
fun = median)[, round(.SD/data_table,1),
.SDcols = c("base_R", "tidy_verse", "data_table")]
```

```
## base_R tidy_verse data_table
## 1: 17.2 5.5 1
```

The short demo above demonstrates just how much more performant using `data.table`

can be.

Here’s a more complete reference on benchmarking, with comparisons across R and Python.

I hope this was a gentle introduction to the ** data.table** package. I think the key to getting off on the right foot with this package is understanding the syntax.

The syntax allows you to do a lot more than the common operations we expect with a base `data.frame`

. Here is the general form of ** data.table** syntax:

`DT[i, j, by]`

`i`

: where (subset) / order by (sort)`j`

: select (grab certain columns) / update (add/modify columns)`by`

: group by

Image source: Blazing Fast Data Wrangling With R data.table

By no means did I intend for this introduction to be an exhaustive guide to all things ** data.table**. If you’re interested in exploring the package further, take a look at these resources:

Fun `R`

fact: `<-`

and `=`

are actually functions, and can be called like so:

```
`<-`(x, 1:5)
x
```

`## [1] 1 2 3 4 5`

```
`=`(x, 5:1)
x
```

`## [1] 5 4 3 2 1`

```
`<-`(x, c(rev(x), x))
x
```

`## [1] 1 2 3 4 5 5 4 3 2 1`

```
`<-`("ill-advised variable name", 1:3)
`ill-advised variable name`
```

`## [1] 1 2 3`

```
`=`("Christopher Guest Movies", "awesome")
`Christopher Guest Movies`
```

`## [1] "awesome"`

This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a tutorial on reading data into R.

There are a lot of different ways to get data into R, and this post highlights a few of the common ways of doing that. This post assumes you have some flat file of data (e.g. csv, txt, excel) you’re trying to read into R. Maybe I’ll write a follow-up post where the data is in a less common format.

Today, we’re going to be taking a quick ride through a few ways to get data from flat files (txt, csv, excel) into R.

Here are links to the documentation for each of the functions discussed.

You can find all the code and data on my GitHub. If you clone that repository you should be able to run all of this on your own machine.

The first function you probably used to read data into R was `read.csv`

.

Let’s suppose you get a basic flat file

```
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
5.4,3.9,1.7,0.4,"setosa"
```

Using `read.csv`

results in:

```
dat <- read.csv("data/iris.txt")
tail(dat)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

But wait, something weird happened with `Species`

:

`dat$Species`

```
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa
```

`read.csv`

has an argument called `stringsAsFactors`

, and its default is TRUE. This means that any string/character type columns you have in your data will be converted to factors (further reading: stringsAsFactors: An unauthorized biography). This is generally not what we want.

So, in the event that I use `read.csv`

(I typically prefer `readr::read_csv`

or `data.table::fread`

, discussed below), I set `stringsAsFactors = FALSE`

.

```
dat2 <- read.csv("data/iris.txt", stringsAsFactors = FALSE)
dat2$Species
```

`## [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"`

`read.csv`

works really well without specifying many arguments when your data is nice. What happens if your data is a little messier?

Here’s the data:

```
"Sepal.Length"|"Sepal.Width"|"Petal.Length"|"Petal.Width"|"Species"
-----------------------------------------------------------------
-----------------------------------------------------------------
5.1|3.5|1.4|0.2|"setosa"
4.9|3|1.4|0.2|"setosa"
4.7|3.2|1.3|0.2|"setosa"
4.6|3.1|1.5|0.2|"setosa"
5|3.6|1.4|0.2|"setosa"
5.4|3.9|1.7|0.4|"setosa"
```

Two things are weird with this data

- Column headers are separated from the data with two rows
`---`

- Use
`skip`

argument

- Use
- A pipe (
`|`

) delimiter is used- Use
`sep = "|"`

argument

- Use

These two oddities require that we use `read.table`

instead of `read.csv`

.

```
# first, get column names from first row of data
column_names <- read.table("data/iris_pipe_delim_edit.txt",
sep = "|", # pipe delim
nrows = 1, # only read first row
stringsAsFactors = FALSE,
header = FALSE) # no headers
# convert to a character vector
(column_names <- as.character(column_names))
```

```
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
```

```
# then, read in remaining rows, using `skip` argument
full_data <- read.table("data/iris_pipe_delim_edit.txt",
sep = "|",
skip = 3, # skip first 3 rows
stringsAsFactors = FALSE,
col.names = column_names) # specify column names
tail(full_data)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

What if our data has weird missing value codes, maybe due to human input.

```
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,NULL,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,MISSING,1.4,0.2,"setosa"
5.4,3.9,9999,0.4,"setosa"
```

By default, when you’re reading in data with the `read.table`

/`read.csv`

family of functions, R treats any columns containing the string `"NA"`

as an `NA`

value. Sometimes we have missing values that take other values, like `999`

, `""`

, and `NULL`

. Using the `na.strings`

argument can help us with this.

```
dat_default_NA <- read.csv("data/iris_weird_NA_edit.txt",
stringsAsFactors = FALSE)
dat_default_NA$Sepal.Width
```

`## [1] "3.5" "3" "3.2" "3.1" "MISSING" "3.9"`

`dat_default_NA$Petal.Length`

`## [1] "1.4" "1.4" "NULL" "1.5" "1.4" "9999"`

Since `R`

found character values in the `Sepal.Width`

and `Petal.Length`

columns, it treats those as characters. We know this is wrong, and can fix it using `na.strings`

.

```
dat_default_NA <- read.csv("data/iris_weird_NA_edit.txt",
stringsAsFactors = FALSE,
na.strings = c("NULL", "9999", "MISSING"))
dat_default_NA$Sepal.Width
```

`## [1] 3.5 3.0 3.2 3.1 NA 3.9`

`dat_default_NA$Petal.Length`

`## [1] 1.4 1.4 NA 1.5 1.4 NA`

`library(readr)`

The next function I want to talk about is `read_csv`

from the ** readr** package.

This function is really helpful, **and it’s pretty much my go-to function to read in flat files into R**. It has good and well-reasoned defaults (no `stringsAsFactors = FALSE`

!), and reads in the data as a `tibble`

as opposed to a `data.frame`

. This makes printing the data to your console a lot better.

Rather than looking at the boring iris data, we’ll instead read some data from the internet. Yes, if you give one of the `read`

functions (even `read.table`

!) a url with a csv/txt file, it will be able to read that into R (conditional on you having a connection to the internet).

```
u1 <- "https://raw.githubusercontent.com/fivethirtyeight/data"
u2 <- "/master/college-majors/all-ages.csv"
(u <- paste0(u1, u2))
```

`## [1] "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"`

`college_data <- read_csv(u) # informative parsing printing`

```
## Parsed with column specification:
## cols(
## Major_code = col_double(),
## Major = col_character(),
## Major_category = col_character(),
## Total = col_double(),
## Employed = col_double(),
## Employed_full_time_year_round = col_double(),
## Unemployed = col_double(),
## Unemployment_rate = col_double(),
## Median = col_double(),
## P25th = col_double(),
## P75th = col_double()
## )
```

`college_data # nice printing of data, don't need head() or tail()`

```
## # A tibble: 173 x 11
## Major_code Major Major_category Total Employed Employed_full_t~
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1100 GENE~ Agriculture &~ 128148 90245 74078
## 2 1101 AGRI~ Agriculture &~ 95326 76865 64240
## 3 1102 AGRI~ Agriculture &~ 33955 26321 22810
## 4 1103 ANIM~ Agriculture &~ 103549 81177 64937
## 5 1104 FOOD~ Agriculture &~ 24280 17281 12722
## 6 1105 PLAN~ Agriculture &~ 79409 63043 51077
## 7 1106 SOIL~ Agriculture &~ 6586 4926 4042
## 8 1199 MISC~ Agriculture &~ 8549 6392 5074
## 9 1301 ENVI~ Biology & Lif~ 106106 87602 65238
## 10 1302 FORE~ Agriculture &~ 69447 48228 39613
## # ... with 163 more rows, and 5 more variables: Unemployed <dbl>,
## # Unemployment_rate <dbl>, Median <dbl>, P25th <dbl>, P75th <dbl>
```

`read_csv`

has a few arguments I should highlight:

: Character vector of strings to interpret as missing values. (this is like`na`

`na.strings`

in`read.csv`

): Number of lines to skip before reading data.`skip`

: Maximum number of records to read.`n_max`

: allows you to specify the column types for your data.`col_types`

**I typically leave this as is, and let the parser do its job**, but it can be helpful if you’re trying to coerce a certain column to certain data type

`library(data.table)`

Now, sometimes you might be dealing with some really nasty data that is large and unwieldy. `read_csv`

is good for maybe 80-90% of data files, but sometimes we need something more powerful.

There is where the `fread`

function from the ** data.table** package comes in handy (further reading: Convenience features of fread).

By some measures, `fread`

can be about 6 times faster than `read.csv`

and about 2.5 times faster than `read_csv`

.

One of the best parts of `fread`

is that you do not necessarily have to specify the delimiter in your data.

For example, the pipe delimited data from above is read in easily. In the code below I set `verbose = TRUE`

to show the internal output from `fread`

. In general, I’d recommend you leave this as FALSE, unless you’re in serious debug mode.

```
pipe_dat <- fread("data/iris_pipe_delim_edit.txt",
skip = 3, # homework: what happens if you don't specify skip?
verbose = TRUE, # default is FALSE, which I recommend
col.names = names(iris))
```

```
## omp_get_num_procs()==8
## R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
## R_DATATABLE_NUM_THREADS==""
## omp_get_thread_limit()==2147483647
## omp_get_max_threads()==8
## OMP_THREAD_LIMIT==""
## OMP_NUM_THREADS==""
## data.table is using 4 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
## RestoreAfterFork==true
## Input contains no \n. Taking this to be a filename to open
## [01] Check arguments
## Using 4 threads (omp_get_max_threads()=8, nth=4)
## NAstrings = [<<NA>>]
## None of the NAstrings look like numbers.
## skip num lines = 3
## show progress = 0
## 0/1 column will be read as integer
## [02] Opening the file
## Opening file data/iris_pipe_delim_edit.txt
## File opened, size = 356 bytes.
## Memory mapped ok
## [03] Detect and skip BOM
## [04] Arrange mmap to be \0 terminated
## \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
## [05] Skipping initial rows if needed
## Skipped to line 4 in the file Positioned on line 4 starting: <<5.1|3.5|1.4|0.2|"setosa">>
## [06] Detect separator, quoting rule, and ncolumns
## Detecting sep automatically ...
## sep='|' with 6 lines of 5 fields using quote rule 0
## Detected 5 columns on line 4. This line is either column names or first data row. Line starts as: <<5.1|3.5|1.4|0.2|"setosa">>
## Quote rule picked = 0
## fill=false and the most number of columns found is 5
## [07] Detect column types, good nrow estimate and whether first row is column names
## Number of sampling jump points = 1 because (150 bytes from row 1 to eof) / (2 * 150 jump0size) == 0
## Type codes (jump 000) : 7777A Quote rule 0
## 'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
## All rows were sampled since file is small so we know nrow=6 exactly
## [08] Assign column names
## [09] Apply user overrides on column types
## After 0 type and 0 drop user overrides : 7777A
## [10] Allocate memory for the datatable
## Allocating 5 column slots (5 - 0 dropped) with 6 rows
## [11] Read the data
## jumps=[0..1), chunk_size=1048576, total_size=150
## Read 6 rows x 5 columns from 356 bytes file in 00:00.000 wall clock time
## [12] Finalizing the datatable
## Type counts:
## 4 : float64 '7'
## 1 : string 'A'
## =============================
## 0.000s ( 0%) Memory map 0.000GB file
## 0.000s ( 0%) sep='|' ncol=5 and header detection
## 0.000s ( 0%) Column type detection using 6 sample rows
## 0.000s ( 0%) Allocation of 6 rows x 5 cols (0.000GB) of which 6 (100%) rows used
## 0.000s ( 0%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 6 rows) using 1 threads
## + 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times)
## + 0.000s ( 0%) Transpose
## + 0.000s ( 0%) Waiting
## 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
## 0.000s Total
```

`pipe_dat`

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 4.7 3.2 1.3 0.2 setosa
## 4: 4.6 3.1 1.5 0.2 setosa
## 5: 5.0 3.6 1.4 0.2 setosa
## 6: 5.4 3.9 1.7 0.4 setosa
```

Here’s a demonstration of how much faster `fread`

is than `read.csv`

and `read_csv`

using a subset of the flights data set.

```
u1 <- "https://github.com/roberthryniewicz/datasets/"
u2 <- "blob/master/airline-dataset/flights/flights.csv?raw=true"
(uu <- paste0(u1, u2))
```

`## [1] "https://github.com/roberthryniewicz/datasets/blob/master/airline-dataset/flights/flights.csv?raw=true"`

`system.time(dat_base <- read.csv(uu)) # timing for read.csv`

```
## user system elapsed
## 15.45 0.15 17.36
```

`system.time(dat_readr <- read_csv(uu)) # timing for read_csv`

```
## Parsed with column specification:
## cols(
## .default = col_double(),
## UniqueCarrier = col_character(),
## TailNum = col_character(),
## Origin = col_character(),
## Dest = col_character(),
## CancellationCode = col_character()
## )
```

`## See spec(...) for full column specifications.`

```
## user system elapsed
## 0.67 0.18 3.29
```

`system.time(dat_fread <- fread(uu)) # timing for fread`

```
## user system elapsed
## 0.38 0.07 1.89
```

`dim(dat_fread) # rows by columns`

`## [1] 100000 29`

I would strongly encourage you to spend some time playing around with `fread`

, and thoroughly investigate its arguments (it’s got a lot!).

Considering all the benefits of `fread`

, I’m actually surprised I don’t use it more.

`library(readxl)`

Maybe you’ve been unlucky enough to have to do some analysis using an excel file. This used to be a tedious task to get the data into R. Now, we can use the `read_excel`

function from the ** readxl** package.

Excel files will typically have multiple sheets. The excel example we’re looking at today has three separate sheets.

Reading these in is straightforward, using the `read_excel`

function and the `sheet`

argument.

```
orders <- read_excel("data/Superstore.xls",
sheet = "Orders")
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L2236 / R2236C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L5276 / R5276C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L8800 / R8800C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9148 / R9148C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9149 / R9149C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9150 / R9150C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9388 / R9388C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9389 / R9389C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9390 / R9390C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9391 / R9391C12: '05408'
```

```
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9743 / R9743C12: '05408'
```

`tail(orders)`

```
## # A tibble: 6 x 21
## `Row ID` `Order ID` `Order Date` `Ship Date` `Ship Mode`
## <dbl> <chr> <dttm> <dttm> <chr>
## 1 9989 CA-2017-1~ 2017-11-17 00:00:00 2017-11-21 00:00:00 Standard C~
## 2 9990 CA-2014-1~ 2014-01-21 00:00:00 2014-01-23 00:00:00 Second Cla~
## 3 9991 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 4 9992 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 5 9993 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 6 9994 CA-2017-1~ 2017-05-04 00:00:00 2017-05-09 00:00:00 Second Cla~
## # ... with 16 more variables: `Customer ID` <chr>, `Customer Name` <chr>,
## # Segment <chr>, Country <chr>, City <chr>, State <chr>, `Postal
## # Code` <dbl>, Region <chr>, `Product ID` <chr>, Category <chr>,
## # `Sub-Category` <chr>, `Product Name` <chr>, Sales <dbl>,
## # Quantity <dbl>, Discount <dbl>, Profit <dbl>
```

```
returns <- read_excel("data/Superstore.xls",
sheet = "Returns")
tail(returns)
```

```
## # A tibble: 6 x 2
## Returned `Order ID`
## <chr> <chr>
## 1 Yes US-2016-140172
## 2 Yes CA-2015-101910
## 3 Yes CA-2017-156958
## 4 Yes CA-2016-105585
## 5 Yes CA-2016-148796
## 6 Yes CA-2015-149636
```

```
people <- read_excel("data/Superstore.xls",
sheet = "People")
tail(people)
```

```
## # A tibble: 4 x 2
## Person Region
## <chr> <chr>
## 1 Anna Andreadi West
## 2 Chuck Magee East
## 3 Kelly Williams Central
## 4 Cassandra Brandow South
```

A few things to note about `read_excel`

**VERY IMPORTANT**Sometimes the function fails if you have the file open. Make sure the excel file is closed before trying to read it into R!- The function will inform you of parsing issues/column type coercion.
- By default, the function will return a
`tibble`

and not a`data.frame`

.

In this post, we’ve looked at a few different ways of getting data into R from flat files. For nice flat files, it’s pretty straightforward to get your data into R. If your data isn’t so nice, you can generally be successful using the `fread`

or `read_csv`

functions, but you’ll need to be very aware of the structure of your data, as well as the arguments for whatever function decide to use.

Here are the links I’ve referenced.

`sessionInfo()`

```
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.3.1 data.table_1.12.2 readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 knitr_1.23 magrittr_1.5 hms_0.4.2
## [5] R6_2.4.0 rlang_0.4.0 fansi_0.4.0 stringr_1.4.0
## [9] tools_3.6.0 xfun_0.8 utf8_1.1.4 cli_1.1.0
## [13] htmltools_0.3.6 assertthat_0.2.1 yaml_2.2.0 digest_0.6.19
## [17] tibble_2.1.3 crayon_1.3.4 bookdown_0.11 vctrs_0.1.0
## [21] zeallot_0.1.0 evaluate_0.14 rmarkdown_1.13 blogdown_0.13
## [25] stringi_1.4.3 compiler_3.6.0 pillar_1.4.1 cellranger_1.1.0
## [29] backports_1.1.4 pkgconfig_2.0.2
```

If you pay attention to NFL football, you’re probably used to hearing that homefield advantage is worth about 3 points. I’ve always been interested in this number, and how it was derived. So, using some data from FiveThirtyEight, along with some linear modeling in R, I attempted to quantify home field advantage. My analysis shows that home field advantage (how much we expect the home team to win by, if the teams are evenly matched) is about 2.59 points.

Here are the packages we’ll need:

```
library(tidyverse)
library(data.table)
library(ggridges)
library(scales)
```

You can find my code for this analysis on my GitHub.

FiveThirtyEight has a data set with game-by-game Elo ratings and forecasts dating back to 1920. Elo ratings are simple measures of strength based on game-by-game results. More details on Elo ratings can be found here.

It’s pretty easy to get this data.

```
data_link <- "https://projects.fivethirtyeight.com/nfl-api/nfl_elo.csv"
nfl_data <- fread(data_link, verbose = FALSE)
```

Here are the first few rows and columns of the data:

date | season | neutral | playoff | team1 | team2 | elo1_pre | elo2_pre |
---|---|---|---|---|---|---|---|

1920-09-26 | 1920 | 0 | RII | STP | 1503.947 | 1300.000 | |

1920-10-03 | 1920 | 0 | DAY | COL | 1493.002 | 1504.908 | |

1920-10-03 | 1920 | 0 | RII | MUN | 1516.108 | 1478.004 | |

1920-10-03 | 1920 | 0 | CHI | MUT | 1368.333 | 1300.000 | |

1920-10-03 | 1920 | 0 | CBD | PTQ | 1504.688 | 1300.000 | |

1920-10-03 | 1920 | 0 | BFF | WBU | 1478.004 | 1300.000 |

The full description of the data can be found on FiveThirtyEight’s GitHub.

We’re interested in a few variables in this data:

variable | definition |
---|---|

elo1_pre | Home team’s Elo rating before the game |

elo2_pre | Away team’s Elo rating before the game |

qbelo1_pre | Home team’s quarterback-adjusted base rating before the game |

qbelo2_pre | Away team’s quarterback-adjusted base rating before the game |

score1 | Home team’s score |

score2 | Away team’s score |

To quantify home field advantage, we can look at the home vs away score differential for all games **not played at a neutral site**. We excluded playoff games from this analysis.

Here’s the summary of that score differential:

measure | value |
---|---|

Min. | -57.00000 |

1st Qu. | -7.00000 |

Median | 3.00000 |

Mean | 2.59056 |

3rd Qu. | 12.00000 |

Max. | 59.00000 |

The median margin of victory is 3. Since this number is positive, it implies that there *is* a noticeable home field advantage.

You might be wondering, has home field advantage been changing over time?

If you look at the most recent decades, you’ll notice that the distribution has been becoming bimodal, meaning there are two “peaks” in the distribution. The peaks belong to margins of victory of three points (a field goal) for the home and away teams:

Decade | Home - Away | Count of Games | % of Games |
---|---|---|---|

2010 | 3 | 203 | 8.02% |

2010 | -3 | 165 | 6.52% |

2000 | 3 | 225 | 8.87% |

2000 | -3 | 173 | 6.82% |

1990 | 3 | 196 | 8.419% |

1990 | -3 | 181 | 7.775% |

1980 | 3 | 159 | 7.472% |

1980 | -3 | 139 | 6.532% |

1970 | 3 | 102 | 5.280% |

1970 | -3 | 98 | 5.072% |

1960 | 3 | 84 | 5.214% |

1960 | 0 | 72 | 4.469% |

1950 | 3 | 37 | 5.096% |

1950 | -4 | 35 | 4.821% |

It’s not just good enough to take the average or median of all home vs away score differentials. Each NFL game is different, and by just blindly taking a summary statistic, we are assuming that the teams playing in each game are evenly matched. In my opinion, this assumption is **invalid**.

We can use linear models to get closer to understanding home field advantage, by adjusting for the differences between the two teams. But before we get too deep into that, let’s take a closer look at linear regression.

A lot of people are familiar with linear models, having performed “line of best fit” calculations sometime in high school. Most people cringe when they see the \(y = mx + b\) formula, but statisticians and data scientists feel their hearts warm and get very excited after glancing at that formula.

Linear models are incredibly powerful tools of statistical analysis. Most of the time, we spend a lot of energy interpreting the \(m\) in the equation above. This gives us insight into how much we expect \(y\) to change (**on average**) when \(x\) changes by some amount.

To illustrate, let’s use the mtcars data set to predict a car’s miles per gallon using its weight in pounds.

The linear model’s formula is displayed in the upper right hand corner of the plot. The coefficient for \(wt\) is -5.3, this is the \(m\) in the \(y = mx + b\) equation. The coefficient is negative, meaning that as the weight of the car increases, we expect its fuel efficiency to get worse. But what about \(b\), the intercept of the line?

The intercept of the best fit line is our “predicted” value of \(y\) when \(x\) equals 0. So, when the weight of a car is 0 pounds, we expect it to get 37.29 miles per gallon. This doesn’t make any sense! A car that weighs 0 pounds doesn’t get any miles per gallon, it doesn’t even exist!

In a lot of linear models, the intercept isn’t really worth interpreting. However, we can use the intercept to understand home field advantage using our NFL data.

We first fit a model trying to predict the home vs away score differential using the home vs away pre game Elo differential and the home vs away pre game QB Elo differential.

When the Elo differentials are equal to zero, it means the teams are effectively even matched (our best guess for the power rankings of the respective teams are basically equal). This gets us closer to understanding true home field advantage than taking a summary statistic would.

Our model will look like:

\[ \text{Home - Away Score} = \beta_{0} + \beta_{1} *\text{Elo Difference} + \beta_{2} * \text{QB Elo Difference}\]

While \(\beta_{1}\) and \(\beta_{2}\) above are interesting, we’re most interested in \(\beta_{0}\), since this quantifies the home field advantage for evenly matched teams.

Enough talking, let’s fit the model:

```
score_diff_model <- lm(home_away_score_diff ~ home_away_elo_diff+home_away_qb_elo_diff,
data = model_data)
```

term | estimate |
---|---|

(Intercept) | 2.586 |

home_away_elo_diff | 0.037 |

home_away_qb_elo_diff | 0.021 |

Touchdown!

If we look at the intercept, the value is about 2.59. This means that if the two teams are basically evenly matched (i.e. \(x=0\) in \(y = mx+b\)), we can expect the home team to win by about 2.59 points (**on average**).

Taking it one step further, here are the home field advantages for each decade.

We can learn some pretty cool things about our data if we pay close attention to the output of our linear models. I think a lot of people forget to pay attention to the intercepts in their linear models. This makes sense most of the time, because the intercept doesn’t really mean much in many of our models (e.g. if we predict a person’s height using their weight, the intercept is meaningless).

However, in some cases the intercept is really important. In this example using NFL data, we were able to use the intercept to quantify home field advantage for evenly matched teams. Hopefully this post will give a data point to bring up at a nice dinner party where you and your acquaintances are discussing what home field advantage *really* means.

Delivering a great data science presentation can seem daunting. By no means am I a communications expert, but I have presented my fair share of talks to a diverse group of audiences. Through my experience, I’ve developed a few easy-to-remember tips to hopefully make your next data science presentation your best yet. These are tips that have worked for me, and I hope they’re helpful!

Without further ado, here are **seven tips for delivering a great data science presentation**.

This is the most important tip I can give to anyone. Practicing your presentation (in front of colleagues, friends, family, or the mirror) is one of the best ways to make sure you’ll feel comfortable on the day of your talk. In my opinion, it’s the best way to “stress-test” your talk and make sure you’re prepared.

By practicing multiple times, you can find portions of your presentation that can be edited, enhanced, or eliminated. By presenting to others and taking their feedback seriously, you can prepare yourself for your real audience and the questions they might have.

A corollary of **practice, practice, practice**, is **perfect practice makes perfect**. In your practice talks, try to simulate the environment of your talk as close as possible. It may help to deliver your practice talk while standing, dress up for your practice talk, or even practice in a public space if you expect your room to be noisy.

Another way to feel more comfortable is to make sure you’re comfortable with the material you’re presenting. I’ve been to a few talks where it was clear that the presenter wasn’t entirely confident with the subject of their talk. Data science is a broad field, so it’s nearly impossible to be an expert in *everything*. Even though you *probably* can’t be an expert in *everything*, you can be knowledgeable about *something*. Talk about that something!

That being said, delivering a presentation *can* be an excellent way to familiarize yourself with a topic you haven’t explored in depth. Using the pressure of delivering a presentation may help you learn something you’ve been struggling to learn.

Imposter syndrome is real and it sucks.

One of my favorite posters growing up was this one:

We’re all at different levels of developing our skills. I’ve found that some of my biggest data science aha moments come from having conversations with people outside of the field. Sometimes, knowing too much *can* be a bad thing, especially when it introduces rigidity or bias to a problem-solving strategy.

No matter what level you’re at, **I guarantee that you can provide useful information to your audience**.

If you were attending your talk, what would you want to learn? how would you expect the presentation to be delivered? how much information do you think you can absorb? You should also remind yourself that most people in the audience are coming to your presentation to have a positive experience.

By taking an empathetic approach to understand the preferences of your audience, you can better prepare your talk and make it as effective as possible.

One of my favorite (and most critical) ways to evaluate my own presentation is to ask myself “so what?” or “why should I care?” for each slide in my deck. If the slide can’t pass that test, I either remove it or edit it until it does. It’s probably an extreme approach, but I like doing this because it ensures that I’ve taken every step I can to reduce any irrelevant sections from my talk.

One of the most notable data presentations is this one from Hans Rosling. He’s engaging, dynamic, and entertaining. His presentation style seems effortless. He makes data fun!

Although I’d like to give presentations like Hans Rosling, I know I can’t right now. The best I can do is to present in the style that best fits me. This doesn’t mean I’ve resigned to delivering flat, boring, and dry talks. Instead, I stay realistic and try to do the best that I can do, not the best Hans Rosling could do.

Speaking of being realistic…

Another way to say this is to **set yourself up for success, not disappointment**.

Here’s an example.

For my first presentation at a professional conference, the conference organizers provided ways to promote your session through social media and email. I thought this was really cool, but I had no interest taking part in these promotional activities. I had already committed to delivering my first talk at a professional conference, which took a lot of time to prepare for. I set a goal of delivering the best talk I could, and promoting my session was not part of my plan to achieve this goal. I stayed focused on my primary goal, and avoided getting distracted by anything else.

Here are a few different ways of turning unrealistic goals into realistic and attainable ones.

Unrealistic | Realistic |
---|---|

Deliver the best damn talk anyone has ever seen | Deliver a strong talk, providing useful information to those in attendance |

Make people fall out of their chairs laughing | Have a few punch lines that make at least one person laugh/chuckle/smirk |

Make my twenty minute talk last exactly twenty minutes | Make my twenty minute talk last about twenty minutes, without sacrificing material or interfering with others’ presentations |

BLOW MINDS | Provide a new and accessible way of framing or thinking about a problem |

Be the next Hans Rosling | Deliver an engaging presentation, without seeming inauthentic |

This tip doesn’t apply to delivering your presentation, it’s about your *next* presentation.

After you’ve finished your talk, do an analysis of your presentation.

- What went well?
- What could have been improved?
- How engaged was your audience?
- An indicator I use to judge engagement are how many people laugh at one of my lame jokes and how many people ask questions

- What questions were asked?
- You can use these questions to understand how well you conveyed your points and how engaging your presentation was.

- Did you receive any feedback or constructive criticism on your content, delivery, and presentation style?

You can learn a lot just by paying attention to how your audience received you and your own perception of your performance.

Just like how most machine learning algorithms get better at minimizing a loss function through iteration after iteration, you can improve your presentation through actively learning and iterating on your presentation style.

Just to recap, here are my tips for a great data science presentation:

- Practice, practice, practice
- Talk about what you know
- Look yourself in the mirror and say “I’m not an impostor”
- Put yourself in the shoes of an audience member
- Play your game, not anyone else’s
- Set realistic expectations, strive to meet them, and hold yourself accountable
- Take audience feedback and questions seriously, and use them to learn and improve

Do you have any tips for giving a great talk? If so, go ahead and leave them in the comments below!

One of my favorite data science blogs comes from James McCaffrey, a software engineer and researcher at Microsoft. He recently wrote a blog post on a method for allocating turns in a multi-armed bandit problem.

I really liked his post, and decided to take a look at the algorithm he described and code up a function to do the simulation in R.

**Note:** this is strictly an implementation of Dr. McCaffrey’s ideas from his blog post, and should not be taken as my own.

You can find the .Rmd file for this post on my GitHub.

The basic idea of a multi-armed bandit is that you have a fixed number of resources (e.g. money at a casino) and you have a number of competing places where you can allocate those resources (e.g. four slot machines at the casino). These allocations occur sequentially, so in the casino example, we choose a slot machine, observe the success or failure from our play, and then make the next allocation decision. Since we’re data scientists at a casino, hopefully we’re using the information we’re gathering to make better gambling decisions (is that an oxymoron?).

We want to choose the best place to allocate our resources, and maximize our reward for each allocation. However, we should shy away from a greedy strategy (just play the winner), because it doesn’t allow us to explore our other options.

There are different strategies for choosing where to allocate your next resource. One of the more popular choices is Thompson sampling, which usually involves sampling from a Beta distribution, and using the results of that sampling to determine your next allocation (out of scope for this blog post!).

`roulette_wheel`

The following function implements the roulette wheel allocation, for a flexible number of slot machines.

The function starts by generating a warm start with the data. We need to gather information about our different slot machines, so we allocate a small number of resources to each one to collect information. After we do this, we start the real allocation. We pick a winner based on how its cumulative probability compares to a draw from a random uniform distribution.

So, if our observed success probabilities are

machine | observed_prob | cumulative_prob | selection_range |
---|---|---|---|

1 | 0.2 | 0.2 | 0.0-0.2 |

2 | 0.3 | 0.5 | 0.2-0.5 |

3 | 0.5 | 1.0 | 0.5-1.0 |

And our draw from the random uniform was 0.7, we’d pick the third arm (0.7 falls between 0.5 and 1). This selection criteria is the main point of Dr. McCaffrey’s algorithm. For a better and more thorough explanation, I’d suggest reading his blog post.

We then continue this process (playing a slot machine, observing the outcome, recalculating observed probabilities, and picking the next slot machine) until we run out of coins.

**And here’s the code**

```
roulette_wheel <- function(coins = 40,
starts = 5,
true_prob = c(0.3, 0.5, 0.7)){
# must have enough coins to generate initial empirical distribution
if (coins < (length(true_prob) * starts)){
stop("To generate a starting distribution, each machine must be",
" played ",
starts,
" times - not enough coins to do so.")
}
# allocate first ("warm up")
SS <- sapply(true_prob, FUN = function(x) sum(rbinom(starts, 1, x)))
FF <- starts - SS
# calculate metrics used for play allocation
probs <- SS / (SS + FF)
probs_normalized <- probs / sum(probs)
cumu_probs_normalized <- cumsum(probs_normalized)
# update number of coins
coins <- coins - (length(true_prob) * starts)
# create simulation data.frame
sim_df <- data.frame(machine = seq_along(true_prob),
true_probabilities = true_prob,
observed_probs = probs,
successes = SS,
failures = FF,
plays = SS + FF,
machine_played = NA,
coins_left = coins)
# initialize before while loop
sim_list <- vector('list', length = coins)
i <- 1
# play until we run out of original coins
while(coins > 0){
# which machine to play?
update_index <- findInterval(runif(1), c(0, cumu_probs_normalized))
# play machine
flip <- rbinom(1, 1, true_prob[update_index])
# update successes and failure for machine that was played
SS[update_index] <- SS[update_index] + flip
FF[update_index] <- FF[update_index] + (1-flip)
# update metrics used for play allocation
probs <- SS / (SS + FF)
probs_normalized <- probs / sum(probs)
cumu_probs_normalized <- cumsum(probs_normalized)
# update number of coins
coins <- coins - 1
# update simulation data.frame (very inefficient)
sim_list[[i]] <- data.frame(machine = seq_along(true_prob),
true_probabilities = true_prob,
observed_probs = probs,
successes = SS,
failures = FF,
plays = SS + FF,
machine_played = seq_along(true_prob) == update_index,
coins_left = coins)
i <- i + 1
}
# show success:failure ratio
message("Success to failure ratio was ",
round(sum(SS) / sum(FF), 2),
"\n",
paste0("(",
paste0(SS, collapse = "+"),
")/(",
paste0(FF, collapse = "+"), ")"))
# return data frame of values from experiment
rbind(sim_df, do.call('rbind', sim_list))
}
```

I’ll show a brief example of what we can do with the data generated from this function.

```
set.seed(123)
rw1 <- roulette_wheel(coins = 5000,
starts = 10,
true_prob = c(0.1, 0.25, 0.5, 0.65))
```

```
## Success to failure ratio was 1.06
## (15+228+835+1490)/(213+662+826+731)
```

machine | true_probabilities | observed_probs | successes | failures | plays | machine_played | coins_left |
---|---|---|---|---|---|---|---|

1 | 0.10 | 0.0658 | 15 | 213 | 228 | FALSE | 0 |

2 | 0.25 | 0.2562 | 228 | 662 | 890 | FALSE | 0 |

3 | 0.50 | 0.5027 | 835 | 826 | 1661 | FALSE | 0 |

4 | 0.65 | 0.6709 | 1490 | 731 | 2221 | TRUE | 0 |

Let’s look at how the observed probabilities changed over time:

And how did our plays for each machine accumulate through time?

Boring!

Maybe if we run a smaller number of simulations, we might get a better sense of variation in our number of plays.

```
set.seed(1)
rw2 <- roulette_wheel(coins = 100,
starts = 5,
true_prob = c(0.1, 0.3, 0.65))
```

```
## Success to failure ratio was 0.82
## (1+16+28)/(11+26+18)
```

machine | true_probabilities | observed_probs | successes | failures | plays | machine_played | coins_left |
---|---|---|---|---|---|---|---|

1 | 0.10 | 0.0833 | 1 | 11 | 12 | FALSE | 0 |

2 | 0.30 | 0.3810 | 16 | 26 | 42 | FALSE | 0 |

3 | 0.65 | 0.6087 | 28 | 18 | 46 | TRUE | 0 |

That shows our allocations a little bit better than the previous visualization.

This was a fun exercise for me, and it reminded me of a presentation I did in graduate school about a very similar topic. I also wrote a roulette wheel function in Python, and was moderately successful at that (it runs faster than my R function, but I’m less confident in how “pythonic” it is).

My biggest concern with this implementation is the potential situation in which our warm start results in all failures for a given slot machine. If the machine fails across the warm start, it will not be selected for the rest of the simulation. To offset this, you could add a little “jitter” (technical term: epsilon) to the observed probabilities at each iteration. Another option would be to generate a second random uniform variable, and if that value is very small, you that pull a random lever, rather than the one determined by the simulation.

Finally, I’d be interested in comparing the statistical properties of this algorithm and others that are used in sequential allocation problems…if I have the time.

Recommendation engines have a huge impact on our online lives. The content we watch on Netflix, the products we purchase on Amazon, and even the homes we buy are all served up using these algorithms. In this post, I’ll run through one of the key metrics used in developing recommendation engines: **cosine similarity**.

First, I’ll give a brief overview of some vocabulary we’ll need to understand recommendation systems. Then, I’ll look at the math behind cosine similarity. Finally, I’m going to use cosine similarity to build a recommendation engine for songs in R.

There are a few different flavors of recommendation engines. One type is **collaborative filtering**, which relies on the behavior of users to understand and predict the similarity between items. There are two subtypes of collaborative filtering: **user-user** and **item-item**. In a nutshell, user-user engines will look for similar users to you, and suggest things that these users have liked (*users like you also bought X*). Item-item recommendation engines generate suggestions based on the similarity of items instead of the similarity of users (*you bought X and Y, maybe you’d like Z too*). Converting an engine from user-user to item-item can reduce the computational cost of generating recommendations.

Another type of recommendation engine is **content-based**. Rather than using the behavior of other users or the similarity between ratings, content-based systems employ information about the items themselves (e.g. genre, starring actors, or when the movie was released). Then, a user’s behavior is examined to generate a user profile, which tries to find content similar to what’s been consumed before based on the characteristics of the content.

Cosine similarity is helpful for building both types of recommender systems, as it provides a way of measuring how similar users, items, or content is. In this post, we’ll be using it to generate song recommendations based on how often users listen to different songs.

The only package we’ll need for this post is:

`library(tidyverse)`

Cosine similarity is built on the geometric definition of the **dot product** of two vectors:

\[\text{dot product}(a, b) =a \cdot b = a^{T}b = \sum_{i=1}^{n} a_i b_i \]

You may be wondering what \(a\) and \(b\) actually represent. If we’re trying to recommend certain products, \(a\) and \(b\) might be the collection of ratings for two products based on the input from \(n\) customers. For example, if \(a =[5, 0, 1]\) and \(b = [0, 1, 2]\), the first customer rated \(a\) a 5 and did not rate \(b\), the second customer did not rate \(a\) and gave \(b\) a 1, and the third customer rated \(a\) a 1 and \(b\) a 2.

With that out of the way, we can layer in geometric information

\[a \cdot b = \Vert a \Vert \Vert b \Vert \text{cos}(\theta)\]

where \(\theta\) is the angle between \(a\) and \(b\) and \(\Vert x \Vert\) is the magnitude/length/norm of a vector \(x\). From the above expression, we can arrive at cosine similarity:

\[\text{cosine similarity} = \text{cos}(\theta) = \frac{a \cdot b}{\Vert a \Vert \Vert b \Vert}\]

In `R`

this is defined as:

`cosine_sim <- function(a, b) crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))`

OK, OK, OK, you’ve seen the formula, and I even wrote an `R`

function, but where’s the intuition? What does it all mean?

What I like to focus on in cosine similarity is the angle \(\theta\). \(\theta\) tells us how far we’d have to move vector \(a\) so that it could rest on top of \(b\). This assumes we can only adjust the orientation of \(a\), and have no ability to influence its magnitude. The easier it is to get \(a\) on top of \(b\), the smaller this angle will be, and the more similar \(a\) and \(b\) will be. Furthermore, the smaller \(\theta\) is, the larger \(\text{cos}(\theta)\) will be. This blog post has a great image demonstrating cosine similarity for a few examples.

For the data we’ll be looking at in this post, \(\text{cos}(\theta)\) will be somewhere between 0 and 1, since user play data is all non-negative. A value of 1 will indicate perfect similarity, and 0 will indicate the two vectors are unrelated. In other applications, there may be data which is positive *and* negative. For these cases, \(\text{cos}(\theta)\) will be between -1 and 1, with -1 meaning \(a\) and \(b\) are perfectly dissimilar and 1 meaning \(a\) and \(b\) are perfectly similar.

We use a subset of the data from the Million Song Dataset. The data only has 10K songs, but that should be enough for this exercise.

```
# read user play data and song data from the internet
play_data <- "https://static.turi.com/datasets/millionsong/10000.txt" %>%
read_tsv(col_names = c('user', 'song_id', 'plays'))
song_data <- 'https://static.turi.com/datasets/millionsong/song_data.csv' %>%
read_csv() %>%
distinct(song_id, title, artist_name)
# join user and song data together
all_data <- play_data %>%
group_by(user, song_id) %>%
summarise(plays = sum(plays, na.rm = TRUE)) %>%
inner_join(song_data)
```

Here are the first few rows of the data. The important variable is `plays`

, which measures how many times a certain user has listened to a song. We’ll be using this variable to generate recommendations.

`knitr::kable(head(all_data, 4))`

user | song_id | plays | title | artist_name |
---|---|---|---|---|

00003a4459f33b92906be11abe0e93efc423c0ff | SOJJRVI12A6D4FBE49 | 1 | Only You (Illuminate Album Version) | David Crowder*Band |

00003a4459f33b92906be11abe0e93efc423c0ff | SOKJWZB12A6D4F9487 | 4 | Do You Want To Know Love (Pray For Rain Album Version) | PFR |

00003a4459f33b92906be11abe0e93efc423c0ff | SOMZHIH12A8AE45D00 | 3 | You’re A Wolf (Album) | Sea Wolf |

00003a4459f33b92906be11abe0e93efc423c0ff | SONFEUF12AAF3B47E3 | 3 | Não É Proibido | Marisa Monte |

There are 76,353 users in this data set, so combining the number of users with songs makes the data a little too unwieldy for this toy example. I’m going to filter our dataset so that it’s only based on the 1,000 most-played songs. We use the `spread`

function to turn our data from being “tall” (one row per user per song) to being “wide” (one row per user, and one column per song).

```
top_1k_songs <- all_data %>%
group_by(song_id, title, artist_name) %>%
summarise(sum_plays = sum(plays)) %>%
ungroup() %>%
top_n(1000, sum_plays) %>%
distinct(song_id)
all_data_top_1k <- all_data %>%
inner_join(top_1k_songs)
top_1k_wide <- all_data_top_1k %>%
ungroup() %>%
distinct(user, song_id, plays) %>%
spread(song_id, plays, fill = 0)
ratings <- as.matrix(top_1k_wide[,-1])
```

This results in having play data for 70,345 users and 994 songs. 1.05% of user-song combinations have `plays`

greater than 0.

Here’s a sample of what the `ratings`

matrix looks like:

`ratings[1:5, 1:3] # one row per user, one column per song`

```
## SOAAVUV12AB0186646 SOABHYV12A6D4F6D0F SOABJBU12A8C13F63F
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
## [5,] 0 0 0
```

I wrote a function called `calc_cos_sim`

, which will calculate the similarity between a chosen song and the other songs, and recommend 5 new songs for a user to listen to. From start to finish, this only took about 20 lines of code, indicating how easy it can be to spin up a recommendation engine.

```
calc_cos_sim <- function(song_code,
rating_mat = ratings,
songs = song_data,
return_n = 5) {
# find our song
song_col_index <- which(colnames(rating_mat) == song_code)
# calculate cosine similarity for each song based on
# number of plays for users
# apply(..., 2) iterates over the columns of a matrix
cos_sims <- apply(rating_mat, 2,
FUN = function(y)
cosine_sim(rating_mat[,song_col_index], y))
# return results
data_frame(song_id = names(cos_sims), cos_sim = cos_sims) %>%
filter(song_id != song_code) %>% # remove self reference
inner_join(songs) %>%
arrange(desc(cos_sim)) %>%
top_n(return_n, cos_sim) %>%
select(song_id, title, artist_name, cos_sim)
}
```

We can use the function above to calculate similarities and generate recommendations for a few songs.

Let’s look at the hip-hop classic “Forgot about Dre” first.

```
forgot_about_dre <- 'SOPJLFV12A6701C797'
knitr::kable(calc_cos_sim(forgot_about_dre))
```

song_id | title | artist_name | cos_sim |
---|---|---|---|

SOZCWQA12A6701C798 | The Next Episode | Dr. Dre / Snoop Dogg | 0.3561683 |

SOHEMBB12A6701E907 | Superman | Eminem / Dina Rae | 0.2507195 |

SOWGXOP12A6701E93A | Without Me | Eminem | 0.1596885 |

SOJTDUS12A6D4FBF0E | None Shall Pass (Main) | Aesop Rock | 0.1591929 |

SOSKDTM12A6701C795 | What’s The Difference | Dr. Dre / Eminem / Alvin Joiner | 0.1390476 |

Each song we recommended is a hip-hop song, which is a good start! Even on this reduced dataset, the engine is making *decent* recommendations.

The next song is “Come As You Are” by Nirvana. Users who like this song probably listen to other grunge/rock songs.

```
come_as_you_are <- 'SODEOCO12A6701E922'
knitr::kable(calc_cos_sim(come_as_you_are))
```

song_id | title | artist_name | cos_sim |
---|---|---|---|

SOCPMIK12A6701E96D | The Man Who Sold The World | Nirvana | 0.3903533 |

SONNNEH12AB01827DE | Lithium | Nirvana | 0.3568732 |

SOLOFYI12A8C145F8D | Heart Shaped Box | Nirvana | 0.1958162 |

SOVDLVT12A58A7B988 | Behind Blue Eyes | Limp Bizkit | 0.1186160 |

SOWBYZF12A6D4F9424 | Fakty | Horkyze Slyze | 0.0952245 |

Alright, 2 for 2. One thing to be mindful of when looking at these results is that we’re not incorporating *any* information about the songs themselves. Our engine isn’t built using any data about the artist, genre, or other musical characteristics. Additionally, we’re not considering any demographic information about the users, and it’s fairly easy to see how useful age, gender, and other user-level data could be in making recommendations. If we used this information in addition to our user play data, we’d have what is called a **hybrid recommendation system**.

Finally, we’ll recommend songs for our hard-partying friends that like the song “Shots” by LMFAO featuring Lil Jon (**that video is not for the faint of heart**).

```
shots <- 'SOJYBJZ12AB01801D0'
knitr::kable(calc_cos_sim(shots))
```

song_id | title | artist_name | cos_sim |
---|---|---|---|

SOWEHOM12A6BD4E09E | 16 Candles | The Crests | 0.2551851 |

SOLQXDJ12AB0182E47 | Yes | LMFAO | 0.1866648 |

SOSZJFV12AB01878CB | Teach Me How To Dougie | California Swag District | 0.1387647 |

SOYGKNI12AB0187E6E | All I Do Is Win (feat. T-Pain_ Ludacris_ Snoop Dogg & Rick Ross) | DJ Khaled | 0.1173063 |

SOUSMXX12AB0185C24 | OMG | Usher featuring will.i.am | 0.1012716 |

Well, the “16 Candles” result is a little surprising, but this might give us some insight into the demographics of users that like “Shots”. The other four recommendations seem pretty solid, I guess.

Cosine similarity is simple to calculate and is fairly intuitive once some basic geometric concepts are understood. The simplicity of this metric makes it a great first-pass option for recommendation systems, and can be treated as a baseline with which to compare more computationally intensive and/or difficult to understand methods.

I think that recommendation systems will continue to play a large role in our online lives. It can be helpful to understand the components underneath these systems, so that we treat them less as blackbox oracles and more as the imperfect prediction systems based on data they are.

I hope you liked this brief excursion into the world of recommendation engines. Hopefully you can walk away knowing a little more about why Amazon, Netflix, and other platforms recommend the content they do.

Here are a few great resources if you want to dive deeper into recommendation systems and cosine similarity.

- Machine Learning :: Cosine Similarity for Vector Space Models (Part III)
- Series of blog posts about cosine similarity
- Wikipedia: Cosine Similarity
- Implementing and Understanding Cosine Similarity
- What are Product Recommendation Engines? And the various versions of them?
- Introduction to Collaborative Filtering

Have you ever wanted to define custom color palettes in Tableau, but didn’t know how? In this post, I’m going to walk through how we can use `R`

to programmatically generate custom palettes in Tableau. Creating custom color palettes for Tableau has never been easier!

This is going to be a short post, with just a little bit of `R`

code.

At the end of the post, you’ll see how to use `R`

to generate custom color palettes to add to Tableau. We’ll show how to add palettes from the viridis color palette and ColorBrewer to Tableau.

Tableau already has a pretty good tutorial and this tutorial is pretty good too, but I thought I’d share some `R`

code that helps to make it easier to define custom palettes.

The basics of defining custom palettes in Tableau is that you have to modify the `Preferences.tps`

file that comes with Tableau. This file can be found in your **My Tableau Repository**. It’s an `XML`

file, which makes it pretty easy to hack around in the text editor of your choice (I prefer Sublime Text).

If we wanted to define a custom palette, based on the three color Set1 palette from ColorBrewer, we would just add this to our `Preferences.tps`

file:

```
<color-palette name="Set1 3 Color Qual Palette" type="regular">
<color>#E41A1C</color>
<color>#377EB8</color>
<color>#4DAF4A</color>
</color-palette>
```

In it, we defined the `name`

, the `type`

(regular, ordered-diverging, or ordered-sequential), and the `color`

s (HEX codes).

If you wanted to hand-edit this file, it might be tedious and you’d need to do a lot of copying-and-pasting.

So, why not write a quick function in `R`

to generate this?

Or, maybe you’d just like a pre-filled `Preferences.tps`

file with many useful palettes added already. If so, check out my GitHub repository which has a fairly complete `Preferences.tps`

file.

`create_tableau_palette`

functionThe following function takes three arguments:

: this is what you want the name to be in your file. In the example above, it was`palette_name`

*Set1 3 Color Qual Palette*. Make sure you name it something descriptive enough to be found easily in Tableau.: this is a character vector of colors which will be added to the palette. You should use HEX codes (e.g.`palette_colors`

`"#E41A1C"`

,`"#377EB8"`

): this is one of the three palette types described above. In the previous example, it was`palette_type`

*regular*.

The function will then print the resulting color palette to the console, so you can copy and paste the results. It uses the `cat`

function, so it **only** prints stuff to the console, it isn’t necessary to store the result in a variable.

```
create_tableau_palette <- function(palette_name,
palette_colors,
palette_type) {
# check palette type
p_type = match.arg(palette_type,
choices = c('ordered-diverging',
'ordered-sequential',
'regular'))
# starting line
line_start <- paste0('<color-palette name="',
palette_name,
'" type="',
p_type,
'">\n')
# define colors
colors <- paste0('<color>',
palette_colors,
'</color>\n')
# ending line
line_end <- "</color-palette>\n"
# push together
cat(paste0(c(line_start, colors, line_end)))
}
```

Here’s an example:

```
# character vector of first four Set2 color values
brewer_4 <- RColorBrewer::brewer.pal(4, 'Set2')
# use the function
create_tableau_palette(palette_name = "Color Brewer Set2 4",
palette_colors = brewer_4,
palette_type = 'regular')
```

```
## <color-palette name="Color Brewer Set2 4" type="regular">
## <color>#66C2A5</color>
## <color>#FC8D62</color>
## <color>#8DA0CB</color>
## <color>#E78AC3</color>
## </color-palette>
```

You could take the result above (remove the `##`

that results from printing) and copy and paste it into the `Preferences.tps`

file.

Of course, we could loop through different specifications to create many custom palettes rather quickly.

Let’s use this function to generate custom Tableau color palettes for the popular viridis palette.

We’re using the ** purrr** package to do our “looping”.

```
# need to store result for better printing, result is just a list of NULL
x<-purrr::map(4:7,
~create_tableau_palette(palette_name = paste('Viridis', .x),
palette_colors = viridis::viridis(.x),
palette_type = 'ordered-sequential'))
```

```
## <color-palette name="Viridis 4" type="ordered-sequential">
## <color>#440154FF</color>
## <color>#31688EFF</color>
## <color>#35B779FF</color>
## <color>#FDE725FF</color>
## </color-palette>
## <color-palette name="Viridis 5" type="ordered-sequential">
## <color>#440154FF</color>
## <color>#3B528BFF</color>
## <color>#21908CFF</color>
## <color>#5DC863FF</color>
## <color>#FDE725FF</color>
## </color-palette>
## <color-palette name="Viridis 6" type="ordered-sequential">
## <color>#440154FF</color>
## <color>#414487FF</color>
## <color>#2A788EFF</color>
## <color>#22A884FF</color>
## <color>#7AD151FF</color>
## <color>#FDE725FF</color>
## </color-palette>
## <color-palette name="Viridis 7" type="ordered-sequential">
## <color>#440154FF</color>
## <color>#443A83FF</color>
## <color>#31688EFF</color>
## <color>#21908CFF</color>
## <color>#35B779FF</color>
## <color>#8FD744FF</color>
## <color>#FDE725FF</color>
## </color-palette>
```

And then you can copy and paste that right into your `Preferences.tps`

file! You’ll need to remove those `##`

symbols, but that shouldn’t be an issue if you’re using this function in your own `R`

session. After you’ve added that to your file, restart Tableau, and then you should find the new palettes in your color choices.

This was a short post illustrating one way to use `R`

to generate custom palettes for Tableau. I really like Tableau as a way to build interactive dashboards, but I have found the default color palettes to be somewhat lacking (or maybe I just have high color palette standards). Hopefully this post will show you how easy it is to add new palettes to Tableau without having to do too much tedious copying-and-pasting.

Jake Low wrote a really interesting piece that presented a few data visualizations that went beyond the typical 2016 election maps we’ve all gotten used to seeing.

I liked a lot of things about Jake’s post, here are three I was particularly fond of:

- His color palette choices
- Each color palette that was used had solid perceptual properties and made sense for the data being visualized (i.e. diverging versus sequential)

- He made residuals from a model interesting by visualizing
*and*interpreting them - He explained the usage of a log-scale transformation in an intuitive way, putting it in terms of the data set being used for the analysis.

In this post, I’m going to replicate Jake’s analysis and then extend it a bit, by fitting a model which is a little more complicated than the one is his post.

In Jake’s post, he used d3.js to do most of the work. As usual, I’ll be using `R`

.

Since this is `R`

, here are the packages I’ll need:

```
library(tidyverse)
library(tidycensus)
library(ggmap)
library(scales)
library(maps)
```

For this analysis, we’ll need a few different data sets, all measured at the county level:

- 2016 election results
- Population density
- Educational information (how many people over 25 have some college or associate’s degree)
- Median income
- 2012 election results

I got election results from this GitHub repository. I plan on making a few heatmaps with this data, so I’m only including information from the contiguous United States. There’s also an issue I don’t fully understand with Alaska vote reporting, where it seems as though county-level reporting doesn’t exist.

```
github_raw <- "https://raw.githubusercontent.com/"
repo <- "tonmcg/County_Level_Election_Results_12-16/master/"
data_file <- "2016_US_County_Level_Presidential_Results.csv"
results_16 <- read_csv(paste0(github_raw, repo, data_file)) %>%
filter(! state_abbr %in% c('AK', 'HI')) %>%
select(-X1) %>%
mutate(total_votes = votes_dem + votes_gop,
trump_ratio_clinton =
(votes_gop/total_votes) / (votes_dem/total_votes),
two_party_ratio = (votes_dem) / (votes_dem + votes_gop)) %>%
mutate(log_trump_ratio = log(trump_ratio_clinton))
```

To get some extra information about the counties I’ll use the ** tidycensus** package. For each county, I’m pulling information about the population over 25, the number of people over 25 with some college or associate’s degree, and the median income.

```
census_data <- get_acs('county',
c(pop_25 = 'B15003_001',
edu = 'B16010_028',
inc = 'B21004_001')) %>%
select(-moe) %>%
spread(variable, estimate)
```

Finally, I pulled in information about population density from the Census American FactFinder. Once I downloaded that data, I threw it into a GitHub repo.

First, we’re going to make the 2016 election map.

I’ll be making all of the maps in this post using `ggplot2`

. I’ve removed the code from this post, but if you look on my GitHub, you’ll notice some funky stuff going on with `scale_fill_gradientn`

. To make the map more visually salient, I’ve played around a bit with how the colors are scaled.

Counties that are red voted more for Trump, and counties that are blue voted more for Clinton.

Then we’ll take a look at population density. Again, I’ve hidden the code to make this map, but I’ve used `ggplot2`

to visualize the data, and I used one of the virdis color palettes (which is what Jake did as well).

If you look between the two maps, you might get a sense of the correlation between population density and the 2016 election results. It’s reasonable to expect that we might be able to do a decent job at predicting election results just using population density.

Finally, we’re going to try to **predict the 2016 election results using population density**.

First, let’s examine a scatter plot of the data.

```
results_and_pop_density <- results_16 %>%
inner_join(pop_density, by = c('combined_fips' = 'FIPS')) %>%
mutate(two_party_ratio = (votes_dem) / (votes_dem + votes_gop))
```

We’re going to fit a linear model which tries to predict the Two Party Vote Ratio \(\frac{N_{clinton}}{N_{clinton}+N_{trump}}\) using population density. Calculating the two party ratio in the way we have means that a county over 0.5 favored Clinton.

```
model1 <- lm(two_party_ratio ~ I(log(population_density)),
data = results_and_pop_density)
# extract residuals
# negative values underestimate Trump (prediction is too high)
# positive values underestimate Clinton (prediction is too low)
results_and_pop_density$resid_model1 <- resid(model1)
```

I’m hiding the code used to generate the map, but what we’re visualizing isn’t actually what the model predicted. Instead, **we’re visualizing the residuals**, which tell us about how good or bad the model’s prediction is. Calculating residuals is straightforward (we just see how far off our prediction was from the actual observed value):

\[\text{residual} = \text{actual proportion} - \text{predicted proportion}\]

The larger the residual (in absolute value), the farther the model was from being right for that observation. Analyzing residuals and learning more about where your model is wrong can be one of the most fascinating parts of statistics and data science.

In my opinion, visualizing the residuals from this simple model tells an even more interesting story than the election map above.

For example, look at the southwest corner of Texas, which had much higher rates of Clinton favorability than the model predicted. Additionally, this map also shows Trump’s appeal through the middle of the country. Much of the nation’s “breadbasket” is colored pink, indicating that these counties favored Trump much more than the model predicted.

My first thought after Jake’s post was that the model he built was pretty simple. There are a few other variables it would be good to adjust for.

What would the previous residual map look like if we controlled for factors like education, income, and age? How about if we add in the two party ratio from the 2012 election (which came from this GitHub repository)?

We’re also going to deviate from the linear model we fit before. I think it would make more sense to fit a logistic regression instead of a linear model, since the thing we want to predict is a proportion (it lives on the range \([0,1]\)), not necessarily a continuous value (which lives on the range \((-\infty, \infty)\)).

I would hope that building a more robust model would make the residual map even more interesting. Let’s see what happens!

```
# fit logistic regression
model2 <- glm(cbind(votes_dem, votes_gop) ~ I(log(population_density))+
I(log(inc))+edu_pct + state_abbr + I(log1p(two_party_2012)),
data = results_pop_census,
family = 'binomial',
na.action = na.exclude)
# extract residuals
results_pop_census$resid_model2 <- resid(model2, type = 'response')
# join up to geographic data
results_pop_census_map <- county_map_with_fips %>%
inner_join(results_pop_census, by = c('fips' = 'combined_fips'))
```

A few things are **noticeably different** between the two maps:

- The upper Midwest is considerably pinker than the previous map. Incorporating the 2012 election results demonstrates how conventional wisdom resulted in poor predictions for 2016.
- Adding variables to the model substantially improved the fit. Notice how the range for the residuals is now much smaller.
- Parts of the Northeast are now much pinker as well. Trump’s vow to bring back coal jobs resonated with this area of the country.

Some things **stayed the same**:

- The southwestern corner (i.e. border towns) of Texas are still green (Clinton performed better than the model would have predicted).
- The “breadbasket” or “flyover” part of the country is still pink (Trump performed better than the model would have predicted).
- Most coastal population centers are green (Clinton performed better than the model would have predicted).

We can also look at counties where the model was most wrong, to see if there are any interesting patterns at the highest level.

```
top_n(results_pop_census, 10, resid_model2) %>%
bind_rows(top_n(results_pop_census, 10, -resid_model2)) %>%
ggplot(aes(reorder(paste0(county_name, ', ', state_abbr), resid_model2), resid_model2))+
geom_col(aes(fill = resid_model2), colour = 'black')+
coord_flip()+
scale_fill_gradientn(values = rescale(c(-.18, -.05, -.02, 0, .02, .05, .18)),
colours = brewer_pal(palette = 'PiYG')(7),
limits = c(-.18, .18),
name = 'Prediction Error (pink underestimates Trump, green underestimates Clinton)')+
theme(legend.position = 'none')+
xlab('')+
ylab('Prediction Error (pink underestimated Trump, green underestimated Clinton)')+
ggtitle('Counties with the Highest Prediction Errors',
subtitle = 'Top 10 over- and under-predictions selected')
```

In the above plot, we can see that three of the highest ten errors in favor of Clinton (the county performed better for Clinton than the model predicted) were from Texas. More analysis showed that five of the highest twenty five errors in favor of Clinton were in Texas. For Trump, Virginia, Kentucky, and Tennessee each had two counties in the top 10.

Finally, we can look at how wrong our new-and-improved model was across states. There are a few ways to summarize residuals over states (we could even build another model that predicted state results), but I’ll opt for the simplest route, and calculate **the median error for each state**.

Simply calculating the median prediction error tells an interesting story about the 2016 election. We see the Midwest lighting up in pink, indicating areas where Trump out-performed the expectations of the model. We see areas out west where Clinton out-performed the expectations of the model. Finally, we see areas which are colored faintly, indicating that the model’s median prediction error was fairly close to 0.

I wanted to build on the great work of Jake Low and demonstrate how going a bit deeper with our model fitting can allow us to refine the data stories we tell. I also wanted to take his analysis (done in d3.js) and demonstrate how it could be replicated in `R`

.

I think the key takeaway from this post is that investigating the residuals from a model can result in some revelatory findings from a data analysis. Our models will always be wrong, but understanding and telling a story about *what* they got wrong can make us feel a little better about that fact.

Thanks for reading through this post. If you want to look at the code for this analysis, you can find it on my GitHub. Let me know what you thought about my post. Did any of the maps surprise you?

In this post, I’m going to make the claim that we can simplify some parts of the machine learning process by using the analogy of making soup. I think this analogy can improve how a data scientist explains machine learning to a broad audience, and it provides a helpful framework throughout the model building process.

Relying on some insight from the CRISP-DM framework, my own experience as an amateur chef, and the well-known iris data set, I’m going to explain why I think that the soup making and machine learning connection is a pretty decent first approximation you could use to understand the machine learning process.

This post is pretty light on code, with just a few code chunks for illustrative purposes. These are the packages we’ll need.

```
library(tidyverse)
library(glmnet)
library(caret) # caret or carrot? :)
```

The code for this post can be found on my GitHub.

**The CRISP-DM Framework (Kenneth Jensen)** CC BY-SA 3.0, Link

I recently gave a presentation on the CRISP-DM framework to the various teams that make up the IT Department at my organization. While I was discussing the Modeling phase of CRISP-DM, I got some questions that come up when you talk about data science with software-minded audiences.

*When you’re building a model, what are you doing? Where are you spending your time? How long does that take?*

The people in IT know that data, machine learning, and artificial intelligence are impacting their daily lives, from chat bots to spam email detection to the curation of news feeds. While they have awareness of the *impact* of data science, they may not have as much awareness of the *processes* of data science. I think the responsibility of demystifying machine learning rests on data scientists, and it’s imperative to have comprehensible mental models that can be employed to describe and de-clutter the machine learning process.

In my response to the questions, I thought I did a fairly good job of breaking down the three components of machine learning and the typical amount of iteration within each component by mirroring the CRISP-DM breakdown:

- Problem type and associated modeling technique
- Iteration level: low

- Parameter tuning
- Iteration level: high

- Feature engineering and selection
- Iteration level: high

Of course, I was in a room full of very smart people, trying to extemporaneously explain parameter tuning and feature engineering in a coherent way, so I probably could have done a better job.

A few minutes after the meeting, I realized I could have used a simple analogy to explain the machine learning process.

**Machine learning is like making a soup. First, you pick the type of soup you want to make. Second, you figure out the ingredients that are going to be in the soup and how they should be prepared. Third, you determine how you’re going to cook the soup. Finally, you taste the soup and iterate to make it taste better.**

While I’m certainly not an expert chef, I think you can boil down making a soup into a few simple components.

In this step, we’re just trying to figure out what we want to make. In the machine learning world, this is where we need to think carefully about the problem we’re trying to solve, and which of the many machine learning algorithms can be used to attack it.

**Soup Making:** what type of soup are we trying to make? are there external characteristics (season, weather, mood) we should consider? what soup will we enjoy? how difficult is this soup to make?

**Machine Learning:** what type of question(s) are we trying to answer? what type of model will allow us to answer this question? how is this model implemented? what are its assumptions?

Now that we’ve decided what we’re going to make, we need to head to the grocery store and pick up the ingredients. After that we’ll need to prepare the ingredients for cooking. Similarly, we’ll need to understand the variables and context for our data set, and create/transform/aggregate our variables to get them into a useful form for modeling.

**Soup Making:** what vegetables are needed and how should they be prepped? what type of protein will we be using? do we need some type of stock for the soup?

**Machine Learning:** what variables are needed for this model? do we need to standardize any of the variables? are there non-numeric variables? if so, how should those be handled?

Once we’ve decided upon the type of soup and the ingredient, it comes time to make the soup. This step involves select the best combination of different cooking methods to make the optimal (i.e. most tasty) soup. When we perform parameter tuning, we’re trying to pick the best combination of values to make our machine learning algorithms reach optimal performance. These values are different from the variables we previously discussed, as the variables are the *inputs* for our ML algorithms, while the tuning parameters describe the algorithm itself.

**Soup Making:** what heat are we cooking at and for how long? is the pot covered or uncovered? will the pot be on the stove top for the entirety of cooking or will we move it to the oven? how long will we let the soup simmer? how big of a batch are we making?

**Machine Learning:** what is our loss function? is there a learning rate? what’s the k in our k-fold cross validation? how many trees in our random forest? how much should we penalize complexity? what is the training/validation split? does an ensemble model outperform a single model?

Let’s see this framework in action. I’m going to pick a straightforward classification task to demonstrate. I’m going to use the iris dataset and try to predict whether a flower is from the setosa species.

Here’s a quick look at the data:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|

5.1 | 2.5 | 3.0 | 1.1 | versicolor |

6.0 | 2.2 | 5.0 | 1.5 | virginica |

5.1 | 3.5 | 1.4 | 0.2 | setosa |

6.6 | 3.0 | 4.4 | 1.4 | versicolor |

4.9 | 2.5 | 4.5 | 1.7 | virginica |

6.3 | 3.3 | 4.7 | 1.6 | versicolor |

Just looking at the scatterplot matrix above, we can see that trying to classify flowers into the setosa species is a bit of a toy problem (the red points are well-separated from the blue and green). In fact, just by examining if the petal length of a flower is smaller than 2.45, we can determine if the flower is from the setosa species.

`table('Petal Cut' = iris$Petal.Length >= 2.45,'Setosa' = iris$Species == 'setosa')`

```
## Setosa
## Petal Cut FALSE TRUE
## FALSE 0 50
## TRUE 100 0
```

The iris data is used as a “hello, world” in data science. It has nice applications across a broad spectrum of applications: clustering, regression, classification, and visualization. It’s worth getting excited about!

As discussed earlier, the problem we’re trying to tackle is predicting if a flower is from the setosa species. Since we already know that there is something we’re trying to predict, we only have to explore supervised machine learning algorithms. We also know that we’re not interested in building a model which generates predictions on a continuous range. We only want the answer a *binary* question: is this a setosa or not?

This narrows the set of algorithms even further, and we only need to explore models which will either generate a classification for a flower, or will generate a probability of the flower’s species being setosa.

For this post, I’m going to use elastic net logistic regression. Elastic net fits a logistic regression while also penalizing the complexity of the model. The excellent ** glmnet** package allows us to build models in this fashion. I’m going to use ridge regression (elastic net with \(\alpha=0\)), which penalizes the \(L_2\) norm of the coefficients. Fitting an elastic net with \(\alpha=1\) generates a LASSO model, which can also be useful for variable selection, but I’m using a different technique to do variable selection in this post, so I’m sticking with ridge here.

Feature engineering and selection is one of the most time-consuming parts of the machine learning process. To keep this post brief, I’m going to go through just a few feature engineering steps.

First, we split the data into training and testing sets. After this, I extract only the numeric predictor variables for the model, and then add squared terms as well as interactions between each of the first order effects. These two steps transform the predictor matrix from four columns into fourteen. Finally, we use the `preProcess`

function from the ** caret** package to center and scale each variable so that it has mean = 0 and variance = 1.

```
set.seed(123)
# create training and testing split
training_index <- sample(nrow(iris), 0.7 * nrow(iris))
iris_train <- iris[training_index,]
iris_test <- iris[-training_index,]
# grab X data add squared term for each column
iris_X_train <- iris_train[,-5] %>% mutate_all(funs('sq' = . ^ 2))
iris_X_test <- iris_test[,-5] %>% mutate_all(funs('sq' = . ^ 2))
# all two way interactions with first order terms
iris_X_train <-
model.matrix(~-1 + . + (Sepal.Length+Sepal.Width+Petal.Length+Petal.Width) ^ 2,
data = iris_X_train)
iris_X_test <-
model.matrix(~-1 + . + (Sepal.Length+Sepal.Width+Petal.Length+Petal.Width) ^ 2,
data = iris_X_test)
# fit center and scale processor on training data
preProc <- preProcess(iris_X_train)
iris_X_train_cs <- predict(preProc, iris_X_train)
iris_X_test_cs <- predict(preProc, iris_X_test)
# labels, convert to factor for glmnet
iris_y_train <- as.factor(as.numeric(iris_train$Species == 'setosa'))
iris_y_test <- as.factor(iris_test$Species == 'setosa')
```

The data augmentation steps I went through created numeric variables, but we could use decision trees or similar techniques to create categorical variables from numeric. We use those methods when it’s unnecessary to retain the continuous nature of a numeric variable (often the case with age or physiological measurements like height or weight).

After we’ve gone through the feature engineering step, we can think about which variables we’ll actually want to use in our model. There can be considerable back-and-forth between feature engineering and feature selection, just like iterating on a recipe may involve different ingredients and different ways of preparing those ingredients.

To do feature selection, I’m going to once again turn to the ** caret** package and use the

`rfe`

function. We could also use the infrastructure of the `glmnet`

```
set.seed(123)
rf_control <- rfeControl(rfFuncs, method = 'cv', number = 5)
iris_rfe <- rfe(x = iris_X_train_cs,
y = iris_y_train,
sizes = 2:14, # select at least two variables
rfeControl = rf_control)
iris_rfe$optVariables
```

`## [1] "Petal.Length:Petal.Width" "Sepal.Length:Petal.Width"`

The procedure we used for variable selection suggested an interaction between the petal length and petal width variables, and petal length squared. It’s usually a bad idea to include higher ordered terms without also including the lower order terms, so our final model will have three variables: petal width, petal length, and petal width * petal length interaction.

In elastic net regression, we have two parameters to select: \(\alpha\) and \(\lambda\). \(\alpha\) controls the weight we give to the \(L_1\) and \(L_2\) penalties (with \(\alpha\) being placed on the \(L_1\) norm, and \(1-\alpha\) being placed on the \(L_2\) norm). Putting all the weight on \(L_2\) norm is better known as Tikhonov regularization or ridge regression. I’ve already made my choice of \(\alpha\) for this post, so we don’t have to tune it.

The other parameter we need to tune is \(\lambda\), which controls the strength of the penalty we’ll place on the \(L_2\) norm of the coefficients. A higher \(\lambda\) value will “shrink” the model’s coefficients, whereas a smaller \(\lambda\) value will result in a model more similar to the standard unregularized logistic regression fit.

The `cv.glmnet`

function allows us to use cross-validation to tune \(\lambda\).

```
vv <- c("Petal.Width", "Petal.Length", "Petal.Length:Petal.Width")
iris_X_train_cs_sub <- iris_X_train_cs[, colnames(iris_X_train_cs) %in% vv]
iris_X_test_cs_sub <- iris_X_test_cs[, colnames(iris_X_test_cs) %in% vv]
```

```
cv_glm1 <- cv.glmnet(x = iris_X_train_cs_sub, y = iris_y_train,
alpha = 0, family = 'binomial',
nfolds = length(iris_y_train) - 1,# LOOCV
standardize = FALSE, intercept = FALSE)
lambda_1se_1 <- cv_glm1$lambda.1se # store "best" lambda for now
plot(cv_glm1)
```

We can explore \(\lambda\) a bit more to improve the model fit.

```
cv_glm2 <- cv.glmnet(x = iris_X_train_cs_sub, y = iris_y_train,
alpha = 0, lambda = exp(seq(-10, log(lambda_1se_1), length.out = 500)),
family = 'binomial', nfolds = length(iris_y_train) - 1,# LOOCV
standardize = FALSE, intercept = FALSE)
lambda_2 <- exp(-6.5)
plot(cv_glm2)
```

Cross-validation suggests a rather small value of \(\lambda\), indicating that the model’s fit doesn’t improve with a high degree of regularization. In the next section, we’ll use two values of \(\lambda\) to fit the model (\(\lambda_1\) = 0.0698, \(\lambda_2\) = 0.0015), and then investigate the results.

After we fit the model (or make the soup), we have to determine how good it is. For this example, I’m going to use a simple method for determining the classification accuracy and check the % of time the classifier got the species correct. Using this metric implies that we’re treating false positive and false negatives as equally bad errors. In most real world situations, this is not the case.

```
# naive classification accuracy
class_acc <- function(preds, labels, thresh = 0.5){
tt <- table(preds > thresh, labels)
sum(diag(tt)) / sum(tt)
}
```

In addition to the two models fit using regularization, I’m going to fit an unregularized model with the selected features along with an unregularized model with only the variables that were present in the original dataset.

```
# regularized models
ridge_1 <- glmnet(x = iris_X_train_cs_sub, y = iris_y_train,
family = 'binomial', lambda = lambda_1se_1, alpha = 0,
standardize = FALSE, intercept = FALSE)
ridge_2 <- glmnet(x = iris_X_train_cs_sub, y = iris_y_train,
family = 'binomial', lambda = lambda_2, alpha = 0,
standardize = FALSE, intercept = FALSE)
# unregularized
glm1 <- glm(iris_y_train ~ -1 + ., data = data.frame(iris_X_train_cs_sub),
family = 'binomial')
glm2 <- glm(iris_y_train ~ -1 + Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,
data = data.frame(iris_X_train_cs),
family = 'binomial')
```

model | type | class accuracy |
---|---|---|

ridge_1 | regularized-lambda1 | 84.4% |

ridge_2 | regularized-lambda2 | 95.6% |

glm1 | full | 100.0% |

glm2 | original | 100.0% |

In the table above, we can see that the unregularized models resulted in better predictions on our test set. This shouldn’t be too surprising, as this classification example is somewhat contrived. In more realistic settings, we may see better predictive performance from models fit using regularization.

When we compare making a soup to machine learning, we get a simple and understandable lens through which we can look at machine learning. Just like making soup or cooking in general, iteration is a key component of machine learning. If you ask anyone that’s trying to develop a recipe, they probably won’t get it right the first time. If they do get it right the first time, maybe that’s because

- they got lucky
- they aren’t trying to make too difficult of a dish
- they’re an experienced cook

These situations have clear parallels to machine learning. Maybe you got lucky and your prediction task is fairly easy or maybe all you need is a simple model or maybe you’re an experienced data scientist.

I feel like this analogy is a pretty straightforward way to explain machine learning to a broad audience of people that are interested in the topic. I found a similar article which discussed ideas that are related to the ones I talked about in my post. If you know of any other posts with similar sentiments, I hope you’ll share them with me!

Thanks for reading my post and leave a comment below if you have any thoughts or feedback!

I’m going to use this post to discuss some of the aspects of data science that interest me most (tidy data as well as using data to guide strategy). I’ll be discussing these topics through the lens of a data analysis of results from a few high school golf tournaments.

I’m going to take a little bit of time to talk about **tidy data**. When I scraped the data used for this analysis, it wasn’t really stored in a tidy format, and there’s a good reason for that. I’ll briefly discuss what makes the original data “untidy”, and what we can do to whip it into tidy shape.

After that, I’ll explore the data a little bit, with the goal of using the findings from this analysis to

- improve strategy on the golf course
- inspire ideas for looking deeper at the data

Finally, I’ll show how we can use linear models along with mixed-effects linear models to build statistical models which allow us to quantify differences between groups of interest.

Here are the questions I hope to answer using the data:

- What are the most difficult holes? Why are they difficult?
- What is the easiest hole? What makes it easier than the rest?
- Are there clear differences in scores between regional and sectional golf tournaments?
- What separates the better high school golfers (those that break 90) from the other golfers?
- What are some general strategy guidelines for playing this golf course?

All of the analysis in this post was performed using `R`

. I’ve omitted some of the code used to generate plots to improve readability. The full .Rmd file can be found on my GitHub.

This analysis will involve looking at results from WIAA golf tournaments that were hosted at Pine Valley Golf Course (PV). Since 2011, either a regional or sectional tournament was played at PV for Division 3 high schools. Pine Valley is a par 71 golf course, with two par 3’s, six par 4’s and one par 5 on the front nine, and two par 3’s, five par 4’s and two par 5’s on the back nine. Here’s an image of the scorecard:

Regional tournaments come before sectionals, with teams and individuals that played well in the regional tournament advancing to sectionals. Teams and individuals that play well in the sectional tournament will advance to the state tournament. It is a fair (and testable!) hypothesis that scores will tend to be better in sectional tournaments.

First I’m going to load a few packages.

```
library(tidyverse)
library(scales)
library(lme4)
theme_set(theme_bw())
```

I’ve already scraped the data and cleaned it up a bit. I’ve thrown the code and data in a github repository.

```
gh <- "https://raw.githubusercontent.com/bgstieber/files_for_blog/master/golf-tidy-data/data"
# read in data and only keep scores lower than 121
tidy_scores <- read_csv(paste0(gh, '/tidy_golf_scores.csv')) %>%
filter(tot <= 120)
untidy_scores <- read_csv(paste0(gh, '/untidy_golf_scores.csv')) %>%
filter(tot <= 120)
```

The data is comprised of 472 golfers with scores for each of the 18 holes they played. In total, there are 8496 scores in this data set.

Let’s take a peek at the first few rows of each data set. The first data set we’ll look at is the untidy data, this is fairly close to what the WIAA provides on the webpages for each of the tournaments.

name | year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | out | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | in | tot | tourn_type | tourn_year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

A1 | 12 | 7 | 4 | 4 | 4 | 5 | 4 | 5 | 6 | 6 | 45 | 6 | 5 | 5 | 5 | 7 | 4 | 5 | 4 | 5 | 46 | 91 | sectionals | 2011 |

A2 | 12 | 5 | 5 | 5 | 5 | 4 | 3 | 5 | 5 | 4 | 41 | 5 | 6 | 5 | 5 | 5 | 7 | 5 | 3 | 5 | 46 | 87 | sectionals | 2011 |

A3 | 12 | 7 | 5 | 6 | 6 | 5 | 6 | 5 | 5 | 4 | 49 | 6 | 6 | 6 | 4 | 7 | 6 | 7 | 4 | 5 | 52 | 101 | sectionals | 2011 |

A4 | 11 | 8 | 5 | 5 | 5 | 7 | 7 | 5 | 5 | 4 | 51 | 6 | 7 | 5 | 4 | 5 | 4 | 5 | 4 | 5 | 45 | 96 | sectionals | 2011 |

A5 | 11 | 14 | 4 | 5 | 4 | 3 | 6 | 7 | 4 | 4 | 51 | 6 | 6 | 11 | 9 | 7 | 8 | 5 | 5 | 7 | 64 | 115 | sectionals | 2011 |

A6 | 11 | 6 | 4 | 4 | 6 | 5 | 4 | 5 | 4 | 3 | 41 | 4 | 5 | 7 | 4 | 6 | 5 | 6 | 3 | 4 | 44 | 85 | sectionals | 2011 |

This is an example of a “wide” data set.

Now let’s take a look at the first few rows of the tidy data.

name | year | out | in | tot | tourn_type | tourn_year | hole | score | par | ob | water | side | rel_to_par |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

A1 | 12 | 45 | 46 | 91 | sectionals | 2011 | 1 | 7 | 5 | TRUE | TRUE | front | 2 |

A2 | 12 | 41 | 46 | 87 | sectionals | 2011 | 1 | 5 | 5 | TRUE | TRUE | front | 0 |

A3 | 12 | 49 | 52 | 101 | sectionals | 2011 | 1 | 7 | 5 | TRUE | TRUE | front | 2 |

A4 | 11 | 51 | 45 | 96 | sectionals | 2011 | 1 | 8 | 5 | TRUE | TRUE | front | 3 |

A5 | 11 | 51 | 64 | 115 | sectionals | 2011 | 1 | 14 | 5 | TRUE | TRUE | front | 9 |

A6 | 11 | 41 | 44 | 85 | sectionals | 2011 | 1 | 6 | 5 | TRUE | TRUE | front | 1 |

On inspection, there are some clear differences between the tidy and untidy data sets.

In some ways, recognizing tidy/untidy data can be one of those *I know it when I see it* things. We can follow the fairly solid guidelines Hadley Wickham has proposed to make the distinction a bit more concrete (if you haven’t read that paper, I **highly** recommend it):

- Each
**variable**is a column - Each
**observation**is a row - Each type of observational unit is a table (this isn’t as important for this post)

It’s important to think about those two bolded words within the context of this analysis. We’re interested in making our way around a golf course, to hopefully play our best. Playing our best means minimizing mistakes *throughout* a round of golf.

Some people like to think of golf as a game composed of 18 “mini games” within it. For an analysis with the primary focus of identifying ways to get around the course as strategically as possible, I think the best way to look at the data is to have each observational unit be the score on **one hole** for **each competitor**. Our tidy data set is constructed this way, with **one row per competitor per hole**. The untidy data has a structure of **one row per competitor**. This structure may be useful if we’re only interested in looking at final scores for each competitor. It’s also useful for a concise summary of a competitor’s performance on a website (which was its original purpose).

Transforming the data from untidy to tidy is fairly simple using the `gather`

function from the ** tidyr** package.

```
untidy_scores %>%
# only select a handful of columns to make printing easier
select(-year, -out, -`in`, -tot, -tourn_type, -tourn_year) %>%
gather(hole, score, -name)
```

```
## # A tibble: 8,496 x 3
## name hole score
## <chr> <chr> <dbl>
## 1 A1 1 7
## 2 A2 1 5
## 3 A3 1 7
## 4 A4 1 8
## 5 A5 1 14
## 6 A6 1 6
## 7 A7 1 7
## 8 A8 1 6
## 9 A9 1 7
## 10 A10 1 11
## # ... with 8,486 more rows
```

Now that we have the data in a way that should be straightforward to analyze, let’s start taking a look at some summaries of the results from these competitions.

We could either look at the scores on each hole or we could look at the score *relative to par* for each hole. Scores will generally be higher depending on the par of the hole. Par 3’s are shorter than par 4’s which are shorter than par 5’s, and the longer the hole the more strokes (usually) it will take to complete. Since this represents an inherent “bias” in the score data, I think it’s better to analyze the score relative to par.

First, we’ll visualize the distribution of scores relative to par.

We’re able to see a few things pretty quickly from the heatmap above:

- Holes 4 and 12 are clearly the hardest on the course, with over 50% of the field making double bogey or worse
- Hole 11 is the easiest on the course, with the highest percentage of the field making a birdie or better, and the lowest percentage making double or worse.
- About 50% of the field made a bogey on 13, which is a par 3

Next, we’re going to take a look at the average score relative to par for each of the 18 holes played at PV. Each of the 18 holes played at least one stroke over par.

The four hardest holes are (in order of difficulty): 12, 4, 7, and 1.

PV was my home course growing up, so here’s a little *local knowledge* about each of those holes.

12 is the longest par 4 on the course, playing at about 430 yards from the back tees. The tee shot can be intimidating with out-of-bounds along the left, and some trees and mounds on the right. This hole seemed to play into the wind more often than downwind, making it even longer than the 430 yards on the card.

4 is rated as the hardest hole on the golf course according to the USGA handicapping system. The tee shot can be somewhat difficult, with out-of-bounds, trees, and water on the left, and more trees on the right. That being said, the hole is not overly long, so hitting something less than driver is not a bad route. The green is guarded by a stream in front of it, and hilly terrain surrounding it. Putting on this green can be difficult, as it has a lot of slope and undulation.

7 is a short dogleg left. The main difficulty on this hole is the tight dogleg golfers must navigate off the tee. There are trees which can block an approach shot if the tee shot veers too far right, and hills along with trees and water on the left side for those getting too aggressive off the tee. The approach into the green (provided one has a clear shot) is not that difficult, with no bunkers and a lot of room to miss. The major difficulty on this hole is the tee shot, but from there it’s fairly straightforward.

1 is a shorter par 5 with a daunting tee shot. There’s OB on the left side, and trees and fescue along the right side. On top of that, a golfer must steady their first tee jitters and focus on hitting a solid tee shot in front of the “gallery” of other players, coaches, and spectators. After hitting the tee shot, the fun doesn’t end. The approach into the green is challenging, as trees surround the green and the hole narrows all the way into the green

Each of the four most difficult holes requires a well-struck tee shot. On the 12th, it’s important to hit a long and straight shot. The 4th and 1st holes require precision and steady nerves to avoid trouble and hit a narrow fairway. The 7th requires a controlled and well-executed tee shot that is shaped right to left.

The next visualization is a lollipop chart, which is a great way to avoid the “visually aggressive” moire effect that bar charts can sometimes fall victim to without losing the perceptually sound concept of position along a common scale. Julia Silge uses this chart type to great effect in some of her blog posts.

We took the same data that was displayed in the previous chart, but augmented it slightly to tell a different story. The orange lollipops on the chart show the holes that have higher scores relative to par than average. Three of the four hardest holes are on the front nine, occurring in the first seven holes of the course. We can also see how much more difficult holes 1, 4, 7, and 12 are than the rest (the “sticks” of the lollipops rise much higher than the other holes which played harder than average).

The easiest holes on the course can be identified as well. Hole 11 stands out as being the easiest hole at PV. Hole 11 is a straightforward par 5 with almost no trouble off the tee and a fairly generous fairway all the way to the green. The hole typically plays downwind (it’s routed in the opposite direction of 12), allowing longer hitters the option of trying to reach it in two.

After a bit of exploratory analysis, we can move on to using statistical modeling to briefly investigate the differences between holes with a little more precision. Additionally, we can use a simple model to test the hypothesis that scores (in relation to par) will be lower at sectional tournaments as opposed to regional.

First, we’ll fit a linear mixed-effects model. I haven’t really worked with mixed-effects models too much since grad school, but this model shouldn’t be too hard to describe.

```
# create factor variables for tournament year and hole to make
# modeling a bit easier
hardest_holes <- c(12L, 4L, 7L, 1L, 18L,
15L, 10L, 3L, 13L,
17L, 6L, 2L, 16L,
8L, 14L, 9L, 5L, 11L)
tidy_scores2 <- tidy_scores %>%
mutate(tourn_year_f = factor(tourn_year),
hole_f = factor(hole,levels = hardest_holes))
## look at past performance to find most difficult holes
simple_mod <- lmer(rel_to_par ~ hole_f+tourn_type+(1|tourn_year_f)+(1|tourn_year_f:name),
data = tidy_scores2)
```

In the model above, we’re fitting *fixed effects* for the hole variable and the tournament type variable. We’re using `(1|tourn_year_f)`

to fit a random effect for the tournament year, and using `(1|tourn_year_f:name)`

to fit a random effect for the competitor *nested within* tournament year. A simple way to think about the fixed versus random effects divide is that if we’re interested in understanding the impact of a variable on our target, we should probably fit it as a fixed effect, if we’re not that interested a random effect is the way to go (this is a gross over-simplification, I’d see this post if you’re looking for something more in-depth).

Alright, so we fit the model, now what?

Let’s see what a summary of the model looks like

`summary(simple_mod)$coefficients`

```
## Estimate Std. Error t value
## (Intercept) 1.88291299 0.09070789 20.757985
## hole_f4 -0.06991525 0.06812175 -1.026328
## hole_f7 -0.18432203 0.06812175 -2.705774
## hole_f1 -0.22881356 0.06812175 -3.358891
## hole_f18 -0.40889831 0.06812175 -6.002463
## hole_f15 -0.44279661 0.06812175 -6.500077
## hole_f10 -0.44703390 0.06812175 -6.562278
## hole_f3 -0.48728814 0.06812175 -7.153194
## hole_f13 -0.53601695 0.06812175 -7.868514
## hole_f17 -0.53813559 0.06812175 -7.899615
## hole_f6 -0.54661017 0.06812175 -8.024018
## hole_f2 -0.55296610 0.06812175 -8.117321
## hole_f16 -0.61864407 0.06812175 -9.081447
## hole_f8 -0.64406780 0.06812175 -9.454657
## hole_f14 -0.67372881 0.06812175 -9.890069
## hole_f9 -0.68008475 0.06812175 -9.983371
## hole_f5 -0.68432203 0.06812175 -10.045573
## hole_f11 -0.80084746 0.06812175 -11.756119
## tourn_typesectionals -0.06353004 0.09519262 -0.667384
```

These results aren’t too interesting, but note that I set up the `hole`

factor variable to compare the other holes to #12, the hardest on the course. The only hole that wasn’t significantly easier than 12 was #4.

We can also look at the estimate for the `tourn_type`

variable. *Directionally*, we got the result we were expecting, i.e. sectionals have lower scores relative to par than regionals on average. However, the coefficient is not significant (thumb rule \(|t| < 2\)) and its effect size is small, indicating that although the directional effect is negative, we’d have a hard time concluding that the effect is really that meaningful.

Oh well, let’s do something more interesting with the results from the model

What I really want is to be able to visualize not only the difficulty of the holes (i.e. a point estimate), but also look at the uncertainty in that difficulty.

Using the solution from this CrossValidated post, we’ll use the bootstrap to generate predictions on a “dummy” data set.

```
# initial setup
dummy_table <- tidy_scores2 %>%
select(hole_f, tourn_type) %>%
unique()
# taken from https://stats.stackexchange.com/a/147837/99673
predFun <- function(fit) {
predict(fit, dummy_table, re.form = NA)
}
# fit the bootstrap...takes a longish time
bb <- bootMer(simple_mod,
nsim = 500,
FUN = predFun,
seed = 101)
```

After munging the data a bit, we’re left with a data set of predictions from the bootstrap.

```
# use the results from the boostrapped model
dummy_table_boot <- cbind(dummy_table, t(bb$t)) %>%
gather(iter, rel_to_par, -hole_f, -tourn_type) %>%
mutate(hole_n = as.numeric(as.character(hole_f)))
dummy_table_boot %>% head
```

```
## hole_f tourn_type iter rel_to_par hole_n
## 1 1 sectionals 1 1.586058 1
## 2 1 regionals 1 1.637165 1
## 3 2 sectionals 1 1.293660 2
## 4 2 regionals 1 1.344768 2
## 5 3 sectionals 1 1.262038 3
## 6 3 regionals 1 1.313146 3
```

We can use these results to visualize the distributions of predictions.

We see that for both regional and sectional tournaments, holes 1, 4, 7, and 12 were much more difficult than the rest. We can also see that scores relative to par were slightly higher for regional tournaments, and that scores tended to vary a bit more in regional tournaments (the boxes for regionals tend to be a bit longer).

As I’ll mention in the conclusion of this post, I think we could dive a bit deeper into the mixed-effects models and potentially investigate interaction effects, or build some more interesting models using feature engineering, but this is as far as I want to go for this post.

Finally, we’re going to take a look at what separates the golfers that broke 90 from the rest of the field.

To make the next visualization, we’ll use a little bit of `map_df`

magic from the ** purrr** package to make a plot that is a cousin of the q-q plot.

```
seq(0,1,.025) %>%
map_df(~untidy_scores %>%
group_by(tourn_type, 'p' = .x) %>%
summarise('perc' = quantile(tot, .x))) %>%
ggplot(aes(p, perc, colour = tourn_type))+
geom_point(size = 2)+
scale_x_continuous('Percentile (lower is better)', labels = percent)+
scale_y_continuous(breaks = seq(70, 160, 10),
name = 'Final Score')+
scale_colour_brewer(palette = 'Set1',
name = '')+
ggtitle('Final Score Percentile Plot by Tournament Type')
```

Breaking 90 is generally a sign of a fairly accomplished high school golfer. In the plot above, we can see that the 90 mark is right around the 30th percentile for both the regional and sectional tournaments. This means that about 30% of the competitors scored 90 or lower. One other interesting insight from this chart is that the performance between the regional and sectional tournament is fairly similar (the scores by percentile are pretty close) except for between the 50th and 75th percentiles (the red dots tend to rise a bit above the blue). It could be interesting to dig into this divergence a bit more in a follow-up analysis.

To explore the difference between those that broke 90 (30% of the field) from those that didn’t, we’ll fit a simple linear model. We’ll use the results of the model to make predictions on a dummy data set and then look at the differences between predicted scores relative to par for each hole for the top 30% and the bottom 70%. We use a linear model to give our analysis a bit more precision than using simple averages across the data. We could also return to this analysis and investigate the coefficients or add complexity to the model if appropriate.

```
tidy_scores3 <- tidy_scores2 %>%
mutate(broke_90 = ifelse(tot < 90, 'Broke 90', 'Did not Break 90'))
# linear model with effects for hole, broke_90 indicator
# and interaction between hole and broke_90
fit2 <- lm(rel_to_par ~ hole_f * broke_90, data = tidy_scores3)
dummy_table2 <- tidy_scores3 %>%
select(hole_f, broke_90) %>%
unique()
cbind(dummy_table2,
predict(fit2, newdata = dummy_table2, interval = 'prediction')) %>%
select(hole_f, broke_90, fit) %>%
spread(broke_90, fit) %>%
mutate(diff_avg = `Did not Break 90` - `Broke 90`) %>%
ggplot(aes(reorder(hole_f, -diff_avg), diff_avg))+
geom_col()+
xlab('Hole (ordered by average difference)')+
ylab('Average Stroke Improvement from Bottom 70% to Top 30%')+
ggtitle('Where do the Better Golfers Shine?',
subtitle = paste0('The golfers finishing in the top 30% tended to',
' perform better on the more difficult holes.'))
```

The chart above demonstrates two clear reasons why the top 30% fared better than the rest:

- They played the hardest holes better
- The top 30% were more than a stroke better than the bottom 70% on holes 1, 4, 7, and 12

- They took advantage of the easiest holes
- The top 30% were also more than a stroke better than the bottom 70% on hole 11 (the easiest hole on the course)

The results from this quick analysis show that the better players **take advantage of the easiest holes** and **minimize their mistakes on the hardest holes**.

In this blog post I talked about tidy data and used data analysis to inform decisions on the golf course. Of course, data science can be used to do more than just guide golfers to lower scores, but I thought it was an interesting application.

The four toughest holes were 1, 4, 7, and 12. Each of these holes require thought off the tee, and some holes have challenging approach shots to the green. We saw that players who broke 90 tended to outperform their higher-scoring counterparts on these holes by more than a stroke.

The easiest hole was number 11, a straightforward par 5 with little trouble, a generous fairway, and usually has a helping wind. This hole had the highest percentage of birdies on the course, and the top golfers took advantage of it. The top 30% played this hole more than a stroke better than the bottom 70%.

It was difficult to identify clear differences between the results from regional and sectional tournaments, but we did see some divergence in final scores between the 50th through 75th percentiles.

We can use the results from this analysis to guide golfers’ strategy a bit. First, they should take advantage of hole 11, as there is almost no risk to being aggressive on this hole. Second, they should think carefully and formulate a game plan for the tee shots on 1, 4, 7, and 12. These holes play as the most difficult, and a lot of the challenge comes from the tee shot. Finally, it’s always a good idea to remember that you’re playing golf, and that it’s a game and it’s supposed to be fun!

In this blog post I used some data science techniques to explore an interesting data set. I think this analysis could be expanded to include more statistical modeling aided by some feature engineering (are there certain characteristics of holes we should investigate? what about player characteristics?). We could also dig deeper into the top 30% and try to determine differences *within* that group, to find commonalities among the best of the best. Finally, it might be interesting to use data from a weather service to identify which years had difficult conditions, and estimate a rain or wind effect on final scores.

Thanks for reading this post and feel free to leave a comment below if you have any thoughts or feedback!

This post will provide an `R`

code-heavy, math-light introduction to selecting the \(k\) in **k** means. It presents the main idea of kmeans, demonstrates how to fit a kmeans in `R`

, provides some components of the kmeans fit, and displays some methods for selecting `k`

. In addition, the post provides some helpful functions which may make fitting kmeans a bit easier.

kmeans clustering is an example of unsupervised learning, where we do not have an output we’re explicitly trying to predict. We may have reasons to believe that there are latent groups within a dataset, so a clustering method can be a useful way to explore and describe pockets of similar observations within a dataset.

Here’s basically what kmeans (the algorithm) does (taken from K-means Clustering):

- Selects K centroids (K rows chosen at random)
- Assigns each data point to its closest centroid
**Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables)**- Assigns data points to their closest centroids
- Continues steps 3 and 4 until the observations are not reassigned or the
**maximum number of iterations**(`R`

uses 10 as a default) is reached.

Here it is in gif form (taken from k-means clustering in a GIF):

As a statistician, I have hard time avoiding resorting to using balls and urns to describe statistical concepts.

Suppose we have \(n\) balls, and each ball has \(p\) characteristics, like \(shape\), \(size\), \(density\), \(\ldots\), and we want to put those \(n\) balls into \(k\) urns (clusters) according to the \(p\) characteristics.

First, we randomly select \(k\) balls (\(balls_{init}\)), and assign the rest of the balls (\(n-k\)) to whichever \(balls_{init}\) it is closest to. After this first assignment, we calculate the centroid of each (\(k\)) collection of balls. The centroids are the averages of the \(p\) characteristics of the balls in each cluster. So, for each cluster, there will be a vector of length \(p\) with the means of the characteristics of the balls *in that cluster*.

After the calculation of the centroid, we then calculate (for each ball) the distances between its \(p\) characteristics and the centroids for each cluster. We assign the ball to the cluster with the centroid it is closest to. Then, we recalculate the centroids and repeat the process. We leave the number of clusters (\(k\)) fixed, but we allow the balls to move between the clusters, depending on which cluster they are closest to.

Either the algorithm will “converge” and between time \(t\) and \(t+1\) no reassignments will occur, or we’ll reach the maximum number of iterations allowed by the algorithm.

##`kmeans`

in `R`

Here’s how we use the `kmeans`

function in `R`

:

`kmeans(x, centers, iters.max, nstart)`

`x`

is our data`centers`

is the**k**in kmeans`iters.max`

controls the**maximum number of iterations**, if the algorithm has not converged, it’s good to bump this number up`nstart`

controls the initial configurations (step 1 in the algorithm), bumping this number up is a good idea, since kmeans tends to be sensitive to initial conditions (which may remind you of sensitivity to initial conditions in chaos theory)

###values it returns

`kmeans`

returns an object of class “kmeans” which has a `print`

and a `fitted`

method. It is a list with at least the following components:

**cluster** - A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

**centers** - A matrix of cluster centers **these are the centroids for each cluster**

**totss** - The total sum of squares.

**withinss** - Vector of within-cluster sum of squares, one component per cluster.

**tot.withinss** - Total within-cluster sum of squares, i.e. sum(withinss).

**betweenss** - The between-cluster sum of squares, i.e. totss-tot.withinss.

**size** - The number of points in each cluster.

To use `kmeans`

, we first need to specify the `k`

. How should we do this?

##Data

For this post, we’ll be using the Boston housing data set. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston, MA.

```
## crim zn indus nox rm age dis rad tax ptratio black lstat medv
## 1 0.00632 18 2.31 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
## 2 0.02731 0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
## 3 0.02729 0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
## 4 0.03237 0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
## 5 0.06905 0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
## 6 0.02985 0 2.18 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
```

Use a scree plot to visualize the reduction in within-cluster error:

```
ss_kmeans <- t(sapply(2:14,
FUN = function(k)
kmeans(x = b_housing,
centers = k,
nstart = 20,
iter.max = 25)[c('tot.withinss','betweenss')]))
plot(2:14, unlist(ss_kmeans[,1]), xlab = 'Clusters', ylab = 'Within Cluster SSE')
```

When we look at the scree plot, we’re looking for the “elbow”. We can see the SSE dropping, but at some point it discontinues its rapid dropping. At what cluster does it stop dropping abruptly?

Stated more verbosely from Wikipedia:

The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”. This “elbow” cannot always be unambiguously identified. Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance.

We can get the percentage of variance explained by typing:

```
tot.ss <- sum(apply(b_housing, 2, var)) * (nrow(b_housing) - 1)
var_explained <- unlist(ss_kmeans[,2]) / tot.ss
plot(2:14, var_explained, xlab = 'Clusters', ylab = '% of Total Variation Explained')
```

Where does the elbow occur in the above plot? That’s pretty subjective (a common theme in unsupervised learning), but for our task we would prefer to have \(\leq 10\) clusters, probably.

We could also opt for the AIC, which basically looks at how well the clusters are fitting to the data, while also penalizing how many clusters are in the final fit. The general rule with AIC is that lower values are better.

First, we define a function which calculates the AIC from the output of `kmeans`

.

```
kmeansAIC <- function(fit){
m = ncol(fit$centers)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}
```

```
aic_k <- sapply(2:14, FUN =
function(k)
kmeansAIC(kmeans(b_housing, centers = k, nstart = 20, iter.max = 25)))
plot(2:14, aic_k, xlab = 'Clusters', ylab = 'AIC from kmeans')
```

Look familiar? It is remarkably similar to looking at the SSE. This is because the main component in calculating AIC is the within-cluster sum of squared errors. Once again, we’re looking for an elbow in the plot, indicating that the decrease in AIC is not happening so rapidly.

BIC is related to AIC in that BIC is AIC’s conservative cousin. When we evaluate models using BIC rather than AIC as our metric, we tend to select smaller models. Calculating BIC is rather similar to that of AIC (we replaced 2 in the AIC calculation with `log(n)`

):

```
kmeansBIC <- function(fit){
m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + log(n) * m * k) # using log(n) instead of 2, penalize model complexity
}
```

```
bic_k <- sapply(2:14, FUN =
function(k)
kmeansBIC(kmeans(b_housing, centers = k, nstart = 20, iter.max = 25)))
plot(2:14, aic_k, xlab = 'Clusters', ylab = 'BIC from kmeans')
```

Once again, the plots are rather similar for this toy example.

`kmeans2`

We can wrap all the previous parts together in a function to get a broad look at the fit of `kmeans`

.

We’ll fit `kmeans`

across a range of centers (`center_range`

). Using the results from these fits, we’ll look at AIC, BIC, within cluster variation, and the % of total variation explained. We can choose to spit out a table to the user (`plot = FALSE`

) or we’ll plot each of the four metrics by the number of clusters (`plot = TRUE`

).

```
kmeans2 <- function(data, center_range, iter.max, nstart, plot = TRUE){
#fit kmeans for each center
all_kmeans <- lapply(center_range,
FUN = function(k)
kmeans(data, center = k, iter.max = iter.max, nstart = nstart))
#extract AIC from each
all_aic <- sapply(all_kmeans, kmeansAIC)
#extract BIC from each
all_bic <- sapply(all_kmeans, kmeansBIC)
#extract tot.withinss
all_wss <- sapply(all_kmeans, FUN = function(fit) fit$tot.withinss)
#extract between ss
btwn_ss <- sapply(all_kmeans, FUN = function(fit) fit$betweenss)
#extract totall sum of squares
tot_ss <- all_kmeans[[1]]$totss
#put in data.frame
clust_res <-
data.frame('Clusters' = center_range,
'AIC' = all_aic,
'BIC' = all_bic,
'WSS' = all_wss,
'BSS' = btwn_ss,
'TSS' = tot_ss)
#plot or no plot?
if(plot){
par(mfrow = c(2,2))
with(clust_res,{
plot(Clusters, AIC)
plot(Clusters, BIC)
plot(Clusters, WSS, ylab = 'Within Cluster SSE')
plot(Clusters, BSS / TSS, ylab = 'Prop of Var. Explained')
})
}else{
return(clust_res)
}
}
kmeans2(data = b_housing, center_range = 2:15, iter.max = 20, nstart = 25)
```

`k`

This is `R`

after all, so surely there must be at least one package to help in determining the “best” number of clusters. ** NbClust** is a viable option.

```
library(NbClust)
best.clust <- NbClust(data = b_housing,
min.nc = 2,
max.nc = 15,
method = 'kmeans')
```

```
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
```

```
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 12 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
```

** NbClust** returns a big object with some information that may or may not be useless for this use case (I stored the rest of the output in

`best.clust`

, but the package still spit out a bunch of stuff). But, it does tell you the best number of clusters as selected by a slew of indices. This function must iterate through all the possible clusters from `min.nc`

to `max.nc`

, You may want to find a *reasonable* range for `min.nc`

and `max.nc`

before resorting to the `NbClust`

function. If you know that 3 clusters won’t be enough, don’t make `NbClust`

even consider it as an option.

There’s also an argument called `index`

in the `NbClust`

function. This value controls which indices are used to determine the best number of clusters. The calculation methods differ between indices and if your data isn’t so nice (e.g. variables with few unique values), the function may fail. The default value is `all`

, which is a collection of 30 (!) indices all used to help determine the best number of clusters.

It may be helpful to try different indices such as `tracew`

, `kl`

, `dindex`

or `duda`

. Unfortunately, you’ll need to specify only one index for each `NbClust`

call (unless you use `index = 'all'`

or `index = 'alllong'`

).

For more details look at the function’s documentation.

This function helps to visualize the centroids for each cluster. It can allow for interpretation of clusters.

The arguments for this function are `fit`

, `levels`

, and `show_N`

:

`fit`

: object returned from a call to`kmeans`

`levels`

: a character vector representing the levels of the variables in the data set used to fit`kmeans`

, this vector will allow a user to control the order in which variables are plotted`show_N`

: a logical value, if TRUE, the plot will contain information about the size of each cluster, if FALSE, a table of counts will be printed prior to the plot

To use the `levels`

argument, the character vector you supply must have the same number of elements as the number of unique variables in the data set used to fit the `kmeans`

. If you specify `levels = c('a','b','c')`

the plotting device will display (from top to bottom) `'c','b','a'`

. If you are not satisfied with the plotting order, the `rev`

function may come in handy.

```
kmeans_viz <- function(fit, levels = NULL, show_N = TRUE){
require(ggplot2)
require(dplyr)
#extract number of clusters
clusts <- length(unique(fit$cluster))
#centroids
kmeans.table <- as.data.frame(t(fit$center), stringsAsFactors = FALSE)
#variable names
kmeans.table$Variable <- row.names(kmeans.table)
#name clusters
names(kmeans.table)[1:clusts] <- paste0('cluster', 1:clusts)
#reshape from wide table to long (makes plotting easier)
kmeans.table <- reshape(kmeans.table, direction = 'long',
idvar = 'Variable',
varying = paste0('cluster', 1:clusts),
v.names = 'cluster')
#number of observations in each cluster
#should we show N in the graph or just print it?
if(show_N){
#show it in the graph
kmeans.table$time <- paste0(kmeans.table$time,
' (N = ',
fit$size[kmeans.table$time],
')')
}else{
#just print it
print(rbind('Cluster' = 1:clusts,
'N' = fit$size))
}
#standardize the cluster means to make a nice plot
kmeans.table %>%
group_by(Variable) %>%
mutate(cluster_stdzd = (cluster - mean(cluster)) / sd(cluster)) -> kmeans.table
#did user specify a variable levels vector?
if(length(levels) == length(unique(kmeans.table$Variable))){
kmeans.table$Variable <- factor(kmeans.table$Variable, levels = levels)
}
#make the plot
ggplot(kmeans.table, aes(x = Variable, y = time))+
geom_tile(colour = 'black', aes(fill = cluster_stdzd))+
geom_text(aes(label = round(cluster,2)))+
coord_flip()+
xlab('')+ylab('Cluster')+
scale_fill_gradient(low = 'white', high = 'grey60')+
theme_bw()+
theme(legend.position = 'none',
axis.title.y = element_blank(),
axis.title.x = element_text(size = 16),
panel.grid = element_blank(),
axis.text = element_text(size = 14),
axis.ticks = element_blank())
}
opt.kmeans <- kmeans(b_housing, centers = 4, nstart = 50, iter.max = 50)
kmeans_viz(opt.kmeans)
```

`kmeans`

to predictWe can predict cluster membership using a few techniques. For the simple plug-and-play method, we can use the `cl_predict`

function from the ** clue** package. For those interested in a more manual approach, we can calculate the centroid distances for the new data, and select whichever cluster is the shortest distance away.

I will demonstrate both techniques.

First, we’re going to select a subset of the Boston dataset to fit a `kmeans`

on. Using the result of `kmeans`

fit on `b_hous.train`

, we’ll try to predict the clusters for a “new” dataset, `b_hous.test`

.

```
#rows to select
set.seed(123)
train_samps <- sample(nrow(b_housing), .7 * nrow(b_housing), replace = F)
#create training and testing set
b_hous.train <- b_housing[train_samps,]
b_hous.test <- b_housing[-train_samps,]
#fit our new kmeans
train.kmeans <- kmeans(b_hous.train, centers = 4, nstart = 50, iter.max = 50)
```

`cl_predict`

The interface is fairly simple to get the predicted values.

We’re going to use `system.time`

to time how long it takes `R`

to do what we want it to.

```
library(clue)
system.time(
test_clusters.clue <- cl_predict(object = train.kmeans, newdata = b_hous.test)
)
```

```
## user system elapsed
## 0.02 0.00 0.02
```

`table(test_clusters.clue)`

```
## test_clusters.clue
## 1 2 3 4
## 84 15 27 26
```

Taken from this nice CrossValidated solution.

```
clusters <- function(x, centers) {
# compute squared euclidean distance from each sample to each cluster center
tmp <- sapply(seq_len(nrow(x)),
function(i) apply(centers, 1,
function(v) sum((x[i, ]-v)^2)))
max.col(-t(tmp)) # find index of min distance
}
system.time(
test_clusters.hand <- clusters(x = b_hous.test, centers = train.kmeans$centers)
)
```

```
## user system elapsed
## 1.81 0.00 1.81
```

`table(test_clusters.hand)`

```
## test_clusters.hand
## 1 2 3 4
## 84 15 27 26
```

`all(test_clusters.hand == test_clusters.clue) #TRUE`

`## [1] TRUE`

We see that `clusters`

is slower than `cl_predict`

, but they return the same result. It would be prudent to use `cl_predict`

.

In this post I walked through the kmeans algorithm, and its implementation in `R`

. Additionally, I discussed some of the ways to select the `k`

in kmeans. The process of selecting and evaluating choices of `k`

will vary from project to project and depend strongly on the goals of an analysis.

It is worth noting that one of the drawbacks of kmeans clustering is that it must put *every* observation into a cluster. There may be anomalies or outliers present in a dataset, so it may not always make sense to enforce the condition that each observation is assigned to a cluster. A different unsupervised learning technique, such as dbscan (density-based spatial clustering of applications with noise) may be more appropriate for tasks in which anomaly detection is necessary. I hope to explore this technique in a future post. In the meantime, here’s an example of Netflix applying dbscan for anomaly detection.

I plan to use this website to present data explorations and analyses in a way that’s understandable to a broad audience. I hope to demonstrate the utility of applying ideas like machine learning, data visualization, and exploratory data analysis to day-to-day life to improve decision-making processes.

I was inspired to create a blog after reading this post by David Robinson.

New blog post: "Advice to aspiring data scientists: start a blog" https://t.co/yMDHqviiBN #rstats pic.twitter.com/9AdPUdbjtE

— David Robinson (@drob) November 14, 2017

I’m a big believer in treating data analysis as an iterative process, and I hope this blog will reinforce the idea that nearly anyone can learn the skills to do data analysis. I’m going to try as hard as I can to avoid the buzzwords and esoteric language that unnecessarily obfuscate data science discussions, so that this blog is accessible to an audience with varying levels of mathematical and statistical sophistication. That being said, I’ll still try to sneak a few data science nuggets for the hardcore data nerds out there!

I’ll mostly be using the ** R** programming language to extract and manipulate data (you’ll find out I’m a huge

`tidyverse`

`Python`

Thanks for checking my blog out!

**Work:** I currently work as a Senior Analyst in the healthcare industry. Prior to that, I was a Senior Data Analyst in the non-profit/fundraising sector. I got my start in data through working in analytics and reporting for a private financing company. On the side, I’ve been a volunteer data scientist for a few different organizations.

**Education:** I graduated with an MS in Statistics from the University of Wisconsin in 2017 (go Badgers! 👐). Prior to that, I graduated with a BS in Mathematical Statistics and Mathematical Economics from St. Cloud State University in 2015 (go Huskies! 🐺).

**Personal:** I live in Madison, WI. I like to spend my spare time reading, playing guitar, playing golf, biking, and watching whatever Wisconsin sports team is on TV.

My email is bgstieber (at) gmail (dot) com

R-bloggers - Blog aggregator of content contributed by bloggers who write about R (in English).

Revolutions - Blog dedicated to news and information of interest to members of the R community.

Statistical Modeling, Causal Inference, and Social Science

eagereyes - Robert Kosara’s website, mostly about data visualization

Julia Silge’s blog - Great blog about a wide array of data science topics

Variance Explained - David Robinson’s (the data scientist, not the basketball player) blog on data science.

FlowingData - Nathan Yau’s blog, mostly about data visualization

Analytics Vidhya - Good blog with data science tutorials

James D. McCaffrey’s blog - Great DS/AI/ML blog. Dr. McCaffrey posts nearly everyday, and his content covers a wide breadth of material. A great blog to read for generating new ideas.