My Year in Books: Goodreads Data Analysis in R

In 2020, I set a goal of reading 30 books. Aided by a last minute charge, I managed to hit this number. I finished my 30th book on December 31st.

As I was finishing up my year of reading, I started thinking about some of the statistics of my year in books:

  • On average, how many pages did I read per day?
  • Did I have any slumps during the year? If so, could the slumps be explained?
  • What would be a reasonable reading goal for 2021?

I tracked all of my books using Goodreads, so I started poking around on the Goodreads website to see if I could access my library.

I used the tidyverse, lubridate, and scales packages in this analysis. You can find the code for this post on my GitHub.

Getting the Data

Getting Goodreads data isn’t too difficult. They have a great export tool, and if you follow this link, you can export your library. If you have a lot of books in your library the export can take a long time. The data export comes with 31 columns.

For this analysis, the columns I’m interested in are Date Read, My Rating (what I rated the book, 0-5 stars), Average Rating, Number of Pages, and Original Publication Year. I added my data to a GitHub repository.

One thing that’s missing from the Goodreads data export is the description of the book. I wrote a python script that uses BeautifulSoup to scrape Goodreads for this information. I don’t use it in this post, but I could see using it in a different post down the road.

The data from Goodreads is mostly good to go, but there are a few tweaks to make before getting started.

gh_link1 <- "https://github.com/bgstieber/files_for_blog/raw/master/"
gh_link2 <- "goodreads-data-analysis/goodreads_library_export.csv"

goodreads_data <- read_csv(paste0(gh_link1, gh_link2)) %>%
  # fix issue with data export for a book
  mutate(`Number of Pages` = ifelse(grepl("Be a Player", Title),
                                    256, `Number of Pages`))

books_2020 <- goodreads_data %>%
  # only 2020 books
  filter(year(`Date Read`) == 2020) %>%
  # create rating_diff and publish_year columns
  mutate(rating_diff = `My Rating` - `Average Rating`,
         publish_year = coalesce(`Original Publication Year`,
                                  `Year Published`)) %>%
  # clean some column names
  rename(date_read = `Date Read`,
         page_count = `Number of Pages`,
         avg_rating = `Average Rating`,
         my_rating = `My Rating`) %>%
  # add when the previous book was finished, sort then lag
  arrange(date_read) %>%
  mutate(previous_book_date = lag(date_read))

For this analysis, I make the assumption that I read only one book at a time (not always true), and that I start reading a book immediately after I finish the previous one (not always true either).

2021 Goal

This post has mostly been an exploratory analysis of my Goodreads data. To make it actionable, let’s focus on setting a data-driven reading goal for 2021.

To start, let’s look at the average number of pages I was reading throughout the year.

I was reading at a pretty consistent pace in the beginning of the year, declined sharply during the warm summer months, and then picked back up at the end of the year.

On average, it took me about 12.2 days to finish a book in 2020. I read at a pace of about 28.9 pages per day.

There were a few clear outliers with respect to reading pace throughout the year. I read two novels (The Remains of the Day and Never Let Me Go, both by Kazuo Ishiguro) very slowly, taking 43 and 28 days to finish those books, respectively. I also read two books at a very fast pace (Red Queen and The Art of Solitude), where I was reading at a pace of 76.6 and 66.7 pages per day, respectively.

If we eliminate those four books, we’re left with a set of books that more closely reflects my typical or baseline reading pace. Looking at the remaining 26 books, I was reading at a pace of about 32.9 pages per day, taking about 11 days to finish a book.

Using the pace of 11 days to finish a book, I could create a goal of reading 365/11 = 33.2 books in 2021. Rounding up, I’ll set a goal of 34 books in 2021.

This represents an increase of 13% over my goal last year, which seems pretty reasonable based on this analysis.

Wrapping Up

In 2020, I set a goal to finish 30 books. On December 31st, I finished The Art of Solitude and completed my reading goal. I explored my Goodreads data to summarize my year in books:

  • I read a total of 10,536 pages in 2020, the average length of a book I read in 2020 was 351.2 pages
  • I read at a pace of 28.9 pages per day
  • On average it took me about 12.2 days to complete each book
  • The longest it took me to finish a book was 43 days (Never Let Me Go), my shortest read time was 3 days (The Art of Solitude)
  • My average rating was 4 stars, the average Goodreads rating of the books I read was 4.1 stars

I also used the Goodreads data to set a data-driven reading goal for 2021. I hope to increase my reading by 13% in 2021 by finishing 34 books.

This was a fun way to look back on my year in books for 2020. There are a few aspects of this data that I could look into like the distribution of genres, the text summary of the book, and text reviews from other Goodreads users. That analysis will have to wait for another day!

Happy reading!


comments powered by Disqus