My Year in Books: Goodreads Data Analysis in R

In 2020, I set a goal of reading 30 books. Aided by a last minute charge, I managed to hit this number. I finished my 30th book on December 31st. As I was finishing up my year of reading, I started thinking about some of the statistics of my year in books: On average, how many pages did I read per day? Did I have any slumps during the year? [Read More]

An Introduction to the data.table Package

This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a short introduction to the data.table package. Although the primary data wrangling package I use is tidyverse, it’s worthwhile to explore other packages that do similar data manipulations. The closest “competitor” to the tidyverse is the data. [Read More]

An Introduction to Reading Data into R

This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a tutorial on reading data into R. There are a lot of different ways to get data into R, and this post highlights a few of the common ways of doing that. This post assumes you have some flat file of data (e. [Read More]

Quantifying Home Field Advantage in the NFL Using Linear Models in R

If you pay attention to NFL football, you’re probably used to hearing that homefield advantage is worth about 3 points. I’ve always been interested in this number, and how it was derived. So, using some data from FiveThirtyEight, along with some linear modeling in R, I attempted to quantify home field advantage. My analysis shows that home field advantage (how much we expect the home team to win by, if the teams are evenly matched) is about 2. [Read More]

7 Tips for Delivering a Great Data Science Presentation

Delivering a great data science presentation can seem daunting. By no means am I a communications expert, but I have presented my fair share of talks to a diverse group of audiences. Through my experience, I’ve developed a few easy-to-remember tips to hopefully make your next data science presentation your best yet. These are tips that have worked for me, and I hope they’re helpful! Without further ado, here are seven tips for delivering a great data science presentation. [Read More]

Roulette Wheels for Multi-Armed Bandits: A Simulation in R

One of my favorite data science blogs comes from James McCaffrey, a software engineer and researcher at Microsoft. He recently wrote a blog post on a method for allocating turns in a multi-armed bandit problem. I really liked his post, and decided to take a look at the algorithm he described and code up a function to do the simulation in R. Note: this is strictly an implementation of Dr. McCaffrey’s ideas from his blog post, and should not be taken as my own. [Read More]

Recommending Songs Using Cosine Similarity in R

Recommendation engines have a huge impact on our online lives. The content we watch on Netflix, the products we purchase on Amazon, and even the homes we buy are all served up using these algorithms. In this post, I’ll run through one of the key metrics used in developing recommendation engines: cosine similarity. First, I’ll give a brief overview of some vocabulary we’ll need to understand recommendation systems. Then, I’ll look at the math behind cosine similarity. [Read More]

Using R to Create Custom Color Palettes for Tableau

Have you ever wanted to define custom color palettes in Tableau, but didn’t know how? In this post, I’m going to walk through how we can use R to programmatically generate custom palettes in Tableau. Creating custom color palettes for Tableau has never been easier! This is going to be a short post, with just a little bit of R code. At the end of the post, you’ll see how to use R to generate custom color palettes to add to Tableau. [Read More]

Iterating on a 2016 Election Analysis

Jake Low wrote a really interesting piece that presented a few data visualizations that went beyond the typical 2016 election maps we’ve all gotten used to seeing. I liked a lot of things about Jake’s post, here are three I was particularly fond of: His color palette choices Each color palette that was used had solid perceptual properties and made sense for the data being visualized (i.e. diverging versus sequential) He made residuals from a model interesting by visualizing and interpreting them He explained the usage of a log-scale transformation in an intuitive way, putting it in terms of the data set being used for the analysis. [Read More]

Everything I Know About Machine Learning I Learned from Making Soup

Introduction In this post, I’m going to make the claim that we can simplify some parts of the machine learning process by using the analogy of making soup. I think this analogy can improve how a data scientist explains machine learning to a broad audience, and it provides a helpful framework throughout the model building process. Relying on some insight from the CRISP-DM framework, my own experience as an amateur chef, and the well-known iris data set, I’m going to explain why I think that the soup making and machine learning connection is a pretty decent first approximation you could use to understand the machine learning process. [Read More]