An Introduction to the data.table Package

This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a short introduction to the data.table package. Although the primary data wrangling package I use is tidyverse, it’s worthwhile to explore other packages that do similar data manipulations. The closest “competitor” to the tidyverse is the data. [Read More]

An Introduction to Reading Data into R

This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a tutorial on reading data into R. There are a lot of different ways to get data into R, and this post highlights a few of the common ways of doing that. This post assumes you have some flat file of data (e. [Read More]

Roulette Wheels for Multi-Armed Bandits: A Simulation in R

One of my favorite data science blogs comes from James McCaffrey, a software engineer and researcher at Microsoft. He recently wrote a blog post on a method for allocating turns in a multi-armed bandit problem. I really liked his post, and decided to take a look at the algorithm he described and code up a function to do the simulation in R. Note: this is strictly an implementation of Dr. McCaffrey’s ideas from his blog post, and should not be taken as my own. [Read More]

Recommending Songs Using Cosine Similarity in R

Recommendation engines have a huge impact on our online lives. The content we watch on Netflix, the products we purchase on Amazon, and even the homes we buy are all served up using these algorithms. In this post, I’ll run through one of the key metrics used in developing recommendation engines: cosine similarity. First, I’ll give a brief overview of some vocabulary we’ll need to understand recommendation systems. Then, I’ll look at the math behind cosine similarity. [Read More]

Using R to Create Custom Color Palettes for Tableau

Have you ever wanted to define custom color palettes in Tableau, but didn’t know how? In this post, I’m going to walk through how we can use R to programmatically generate custom palettes in Tableau. Creating custom color palettes for Tableau has never been easier! This is going to be a short post, with just a little bit of R code. At the end of the post, you’ll see how to use R to generate custom color palettes to add to Tableau. [Read More]

Iterating on a 2016 Election Analysis

Jake Low wrote a really interesting piece that presented a few data visualizations that went beyond the typical 2016 election maps we’ve all gotten used to seeing. I liked a lot of things about Jake’s post, here are three I was particularly fond of: His color palette choices Each color palette that was used had solid perceptual properties and made sense for the data being visualized (i.e. diverging versus sequential) He made residuals from a model interesting by visualizing and interpreting them He explained the usage of a log-scale transformation in an intuitive way, putting it in terms of the data set being used for the analysis. [Read More]

Everything I Know About Machine Learning I Learned from Making Soup

Introduction In this post, I’m going to make the claim that we can simplify some parts of the machine learning process by using the analogy of making soup. I think this analogy can improve how a data scientist explains machine learning to a broad audience, and it provides a helpful framework throughout the model building process. Relying on some insight from the CRISP-DM framework, my own experience as an amateur chef, and the well-known iris data set, I’m going to explain why I think that the soup making and machine learning connection is a pretty decent first approximation you could use to understand the machine learning process. [Read More]

Golf, Tidy Data, and Using Data Analysis to Guide Strategy

Introduction I’m going to use this post to discuss some of the aspects of data science that interest me most (tidy data as well as using data to guide strategy). I’ll be discussing these topics through the lens of a data analysis of results from a few high school golf tournaments. I’m going to take a little bit of time to talk about tidy data. When I scraped the data used for this analysis, it wasn’t really stored in a tidy format, and there’s a good reason for that. [Read More]

An Introduction to the kmeans Algorithm

This post will provide an R code-heavy, math-light introduction to selecting the \(k\) in k means. It presents the main idea of kmeans, demonstrates how to fit a kmeans in R, provides some components of the kmeans fit, and displays some methods for selecting k. In addition, the post provides some helpful functions which may make fitting kmeans a bit easier. kmeans clustering is an example of unsupervised learning, where we do not have an output we’re explicitly trying to predict. [Read More]