In 2020, I set a goal of reading 30 books. Aided by a last minute charge, I managed to hit this number. I finished my 30th book on December 31st.
As I was finishing up my year of reading, I started thinking about some of the statistics of my year in books:
- On average, how many pages did I read per day?
- Did I have any slumps during the year? If so, could the slumps be explained?
- What would be a reasonable reading goal for 2021?
I tracked all of my books using Goodreads, so I started poking around on the Goodreads website to see if I could access my library.
I used the tidyverse
, lubridate
, and scales
packages in this analysis. You can find the code for this post on my GitHub.
Getting the Data
Getting Goodreads data isn’t too difficult. They have a great export tool, and if you follow this link, you can export your library. If you have a lot of books in your library the export can take a long time. The data export comes with 31 columns.
For this analysis, the columns I’m interested in are Date Read, My Rating (what I rated the book, 0-5 stars), Average Rating, Number of Pages, and Original Publication Year. I added my data to a GitHub repository.
One thing that’s missing from the Goodreads data export is the description of the book. I wrote a python script that uses BeautifulSoup to scrape Goodreads for this information. I don’t use it in this post, but I could see using it in a different post down the road.
The data from Goodreads is mostly good to go, but there are a few tweaks to make before getting started.
gh_link1 <- "https://github.com/bgstieber/files_for_blog/raw/master/"
gh_link2 <- "goodreads-data-analysis/goodreads_library_export.csv"
goodreads_data <- read_csv(paste0(gh_link1, gh_link2)) %>%
# fix issue with data export for a book
mutate(`Number of Pages` = ifelse(grepl("Be a Player", Title),
256, `Number of Pages`))
books_2020 <- goodreads_data %>%
# only 2020 books
filter(year(`Date Read`) == 2020) %>%
# create rating_diff and publish_year columns
mutate(rating_diff = `My Rating` - `Average Rating`,
publish_year = coalesce(`Original Publication Year`,
`Year Published`)) %>%
# clean some column names
rename(date_read = `Date Read`,
page_count = `Number of Pages`,
avg_rating = `Average Rating`,
my_rating = `My Rating`) %>%
# add when the previous book was finished, sort then lag
arrange(date_read) %>%
mutate(previous_book_date = lag(date_read))
For this analysis, I make the assumption that I read only one book at a time (not always true), and that I start reading a book immediately after I finish the previous one (not always true either).
Trends and Analysis
Here is the timeline of my year in books:
Sometimes the most basic data visualizations present the most compelling information.
Here are a few things that stood out to me:
- My sprint at the end of the year to hit my reading goal
- A few books with longer read times: The Remains of the Day, Never Let Me Go, and Be a Player: How to Become a Better Golfer Every Time You Play (to a lesser extent)
- These will come up again in calculating my 2021 goal
- Apart from the few books mentioned above, I had pretty consistent read times for my 2020 books. What might be driving this?
In the code below, I create a data.frame
with cumulative pages and books read by date.
summary_by_date <- books_2020 %>%
group_by(date_read, Title) %>%
summarise(pages = sum(page_count),
books = n()) %>%
ungroup() %>%
# add dummy data for beginning of year
bind_rows(tibble(date_read = as.Date("2020-01-01"),
Title = NA_character_,
pages = 0,
books = 0)) %>%
arrange(date_read) %>%
mutate(previous_date = lag(date_read)) %>%
mutate(days_since_last_book = as.numeric(difftime(
date_read, previous_date, units = "days"
))) %>%
mutate(cumu_pages_read = cumsum(pages),
cumu_books_read = cumsum(books))
Using this data, I can look at my progress toward 30 books through the year.
My reading certainly slowed down during the summer months. Most of this is due to me doing other things during a beautiful Wisconsin summer like playing golf and riding my bike. Between January and May, I read an average of 39.6 pages per day, between June and September, I read about 14.8 pages per day, and finishing off the year, I read 31.3 pages per day from October through the end of the year.
For most of the year, I had a fairly consistent book-finishing pace. I think a lot of this can be explained by choosing shorter books in 2020. 70% of the books I read this year were less than 400 pages long.
Another interesting aspect of the books I read in 2020 was that they were mostly modern. 80% of the books I read in 2020 were published in 1990 or later.
books_2020 %>%
ggplot(aes(publish_year))+
geom_bar()+
xlab("Year Published")+
ylab("Books")+
ggtitle("When were my 2020 reads published?",
subtitle = paste0(percent(mean(books_2020$publish_year >= 1990)),
" of books I read in 2020 were published ",
"in 1990 or later."))
The oldest book I read was The House of Mirth by Edith Wharton, published in 1905. The most recent book I read was The Art of Solitude by Stephen Batchelor, published in 2020.
Finally, let’s take a look at how my rating of a book compared to the average rating from other Goodreads users.
books_2020 %>%
mutate(title_abbrev =
ifelse(nchar(Title) > 60,
paste0(substr(Title, 1, 60), "..."),
Title)) %>%
ggplot(aes(reorder(title_abbrev, rating_diff),
rating_diff,
fill = factor(my_rating)))+
geom_col(colour = "black")+
coord_flip()+
scale_fill_viridis_d("My Rating", option = "cividis")+
xlab("")+
ylab("My Rating - Goodreads Avg")+
theme(legend.position = "top",
axis.text.y = element_text(size = 8))+
ggtitle("My Rating Versus the Goodreads Average")
My average rating in 2020 was 4, the average Goodreads rating of the books I read in 2020 was 4.1. I gave 9 books 3 stars, 11 books a rating of 4 stars, and I gave 10 books 5 stars.
2021 Goal
This post has mostly been an exploratory analysis of my Goodreads data. To make it actionable, let’s focus on setting a data-driven reading goal for 2021.
To start, let’s look at the average number of pages I was reading throughout the year.
I was reading at a pretty consistent pace in the beginning of the year, declined sharply during the warm summer months, and then picked back up at the end of the year.
On average, it took me about 12.2 days to finish a book in 2020. I read at a pace of about 28.9 pages per day.
There were a few clear outliers with respect to reading pace throughout the year. I read two novels (The Remains of the Day and Never Let Me Go, both by Kazuo Ishiguro) very slowly, taking 43 and 28 days to finish those books, respectively. I also read two books at a very fast pace (Red Queen and The Art of Solitude), where I was reading at a pace of 76.6 and 66.7 pages per day, respectively.
If we eliminate those four books, we’re left with a set of books that more closely reflects my typical or baseline reading pace. Looking at the remaining 26 books, I was reading at a pace of about 32.9 pages per day, taking about 11 days to finish a book.
Using the pace of 11 days to finish a book, I could create a goal of reading 365/11 = 33.2 books in 2021. Rounding up, I’ll set a goal of 34 books in 2021.
This represents an increase of 13% over my goal last year, which seems pretty reasonable based on this analysis.
Wrapping Up
In 2020, I set a goal to finish 30 books. On December 31st, I finished The Art of Solitude and completed my reading goal. I explored my Goodreads data to summarize my year in books:
- I read a total of 10,536 pages in 2020, the average length of a book I read in 2020 was 351.2 pages
- I read at a pace of 28.9 pages per day
- On average it took me about 12.2 days to complete each book
- The longest it took me to finish a book was 43 days (Never Let Me Go), my shortest read time was 3 days (The Art of Solitude)
- My average rating was 4 stars, the average Goodreads rating of the books I read was 4.1 stars
I also used the Goodreads data to set a data-driven reading goal for 2021. I hope to increase my reading by 13% in 2021 by finishing 34 books.
This was a fun way to look back on my year in books for 2020. There are a few aspects of this data that I could look into like the distribution of genres, the text summary of the book, and text reviews from other Goodreads users. That analysis will have to wait for another day!
Happy reading!