An Introduction to Reading Data into R

This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a tutorial on reading data into R.

There are a lot of different ways to get data into R, and this post highlights a few of the common ways of doing that. This post assumes you have some flat file of data (e.g. csv, txt, excel) you’re trying to read into R. Maybe I’ll write a follow-up post where the data is in a less common format.

Introduction

Today, we’re going to be taking a quick ride through a few ways to get data from flat files (txt, csv, excel) into R.

Here are links to the documentation for each of the functions discussed.

You can find all the code and data on my GitHub. If you clone that repository you should be able to run all of this on your own machine.

Examples

read.csv - basics

The first function you probably used to read data into R was read.csv.

Let’s suppose you get a basic flat file

"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
5.4,3.9,1.7,0.4,"setosa"

Using read.csv results in:

dat <- read.csv("data/iris.txt")
tail(dat)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

But wait, something weird happened with Species:

dat$Species
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa

read.csv has an argument called stringsAsFactors, and its default is TRUE. This means that any string/character type columns you have in your data will be converted to factors (further reading: stringsAsFactors: An unauthorized biography). This is generally not what we want.

So, in the event that I use read.csv (I typically prefer readr::read_csv or data.table::fread, discussed below), I set stringsAsFactors = FALSE.

dat2 <- read.csv("data/iris.txt", stringsAsFactors = FALSE)
dat2$Species
## [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

read.csv works really well without specifying many arguments when your data is nice. What happens if your data is a little messier?

read.table - Column headers are separated, weird delimiter

Here’s the data:

"Sepal.Length"|"Sepal.Width"|"Petal.Length"|"Petal.Width"|"Species"
-----------------------------------------------------------------
----------------------------------------------------------------- 
5.1|3.5|1.4|0.2|"setosa"
4.9|3|1.4|0.2|"setosa"
4.7|3.2|1.3|0.2|"setosa"
4.6|3.1|1.5|0.2|"setosa"
5|3.6|1.4|0.2|"setosa"
5.4|3.9|1.7|0.4|"setosa"

Two things are weird with this data

  1. Column headers are separated from the data with two rows ---
    • Use skip argument
  2. A pipe (|) delimiter is used
    • Use sep = "|" argument

These two oddities require that we use read.table instead of read.csv.

# first, get column names from first row of data
column_names <- read.table("data/iris_pipe_delim_edit.txt",
                           sep = "|", # pipe delim
                           nrows = 1, # only read first row
                           stringsAsFactors = FALSE,
                           header = FALSE) # no headers
# convert to a character vector
(column_names <- as.character(column_names))
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
# then, read in remaining rows, using `skip` argument
full_data <- read.table("data/iris_pipe_delim_edit.txt",
                        sep = "|",
                        skip = 3, # skip first 3 rows
                        stringsAsFactors = FALSE,
                        col.names = column_names) # specify column names

tail(full_data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

read.csv - Missing Values are Coded Strangely

What if our data has weird missing value codes, maybe due to human input.

"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,NULL,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,MISSING,1.4,0.2,"setosa"
5.4,3.9,9999,0.4,"setosa"

By default, when you’re reading in data with the read.table/read.csv family of functions, R treats any columns containing the string "NA" as an NA value. Sometimes we have missing values that take other values, like 999, "", and NULL. Using the na.strings argument can help us with this.

dat_default_NA <- read.csv("data/iris_weird_NA_edit.txt", 
                           stringsAsFactors = FALSE)

dat_default_NA$Sepal.Width
## [1] "3.5"     "3"       "3.2"     "3.1"     "MISSING" "3.9"
dat_default_NA$Petal.Length
## [1] "1.4"  "1.4"  "NULL" "1.5"  "1.4"  "9999"

Since R found character values in the Sepal.Width and Petal.Length columns, it treats those as characters. We know this is wrong, and can fix it using na.strings.

dat_default_NA <- read.csv("data/iris_weird_NA_edit.txt",
                           stringsAsFactors = FALSE,
                           na.strings = c("NULL", "9999", "MISSING"))

dat_default_NA$Sepal.Width
## [1] 3.5 3.0 3.2 3.1  NA 3.9
dat_default_NA$Petal.Length
## [1] 1.4 1.4  NA 1.5 1.4  NA

readr::read_csv

library(readr)

The next function I want to talk about is read_csv from the readr package.

This function is really helpful, and it’s pretty much my go-to function to read in flat files into R. It has good and well-reasoned defaults (no stringsAsFactors = FALSE!), and reads in the data as a tibble as opposed to a data.frame. This makes printing the data to your console a lot better.

Rather than looking at the boring iris data, we’ll instead read some data from the internet. Yes, if you give one of the read functions (even read.table!) a url with a csv/txt file, it will be able to read that into R (conditional on you having a connection to the internet).

u1 <- "https://raw.githubusercontent.com/fivethirtyeight/data"

u2 <- "/master/college-majors/all-ages.csv"

(u <- paste0(u1, u2))
## [1] "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"
college_data <- read_csv(u) # informative parsing printing
## Parsed with column specification:
## cols(
##   Major_code = col_double(),
##   Major = col_character(),
##   Major_category = col_character(),
##   Total = col_double(),
##   Employed = col_double(),
##   Employed_full_time_year_round = col_double(),
##   Unemployed = col_double(),
##   Unemployment_rate = col_double(),
##   Median = col_double(),
##   P25th = col_double(),
##   P75th = col_double()
## )
college_data # nice printing of data, don't need head() or tail()
## # A tibble: 173 x 11
##    Major_code Major Major_category  Total Employed Employed_full_t~
##         <dbl> <chr> <chr>           <dbl>    <dbl>            <dbl>
##  1       1100 GENE~ Agriculture &~ 128148    90245            74078
##  2       1101 AGRI~ Agriculture &~  95326    76865            64240
##  3       1102 AGRI~ Agriculture &~  33955    26321            22810
##  4       1103 ANIM~ Agriculture &~ 103549    81177            64937
##  5       1104 FOOD~ Agriculture &~  24280    17281            12722
##  6       1105 PLAN~ Agriculture &~  79409    63043            51077
##  7       1106 SOIL~ Agriculture &~   6586     4926             4042
##  8       1199 MISC~ Agriculture &~   8549     6392             5074
##  9       1301 ENVI~ Biology & Lif~ 106106    87602            65238
## 10       1302 FORE~ Agriculture &~  69447    48228            39613
## # ... with 163 more rows, and 5 more variables: Unemployed <dbl>,
## #   Unemployment_rate <dbl>, Median <dbl>, P25th <dbl>, P75th <dbl>

read_csv has a few arguments I should highlight:

  • na: Character vector of strings to interpret as missing values. (this is like na.strings in read.csv)
  • skip: Number of lines to skip before reading data.
  • n_max: Maximum number of records to read.
  • col_types: allows you to specify the column types for your data. I typically leave this as is, and let the parser do its job, but it can be helpful if you’re trying to coerce a certain column to certain data type

data.table::fread

library(data.table)

Now, sometimes you might be dealing with some really nasty data that is large and unwieldy. read_csv is good for maybe 80-90% of data files, but sometimes we need something more powerful.

There is where the fread function from the data.table package comes in handy (further reading: Convenience features of fread).

By some measures, fread can be about 6 times faster than read.csv and about 2.5 times faster than read_csv.

One of the best parts of fread is that you do not necessarily have to specify the delimiter in your data.

For example, the pipe delimited data from above is read in easily. In the code below I set verbose = TRUE to show the internal output from fread. In general, I’d recommend you leave this as FALSE, unless you’re in serious debug mode.

pipe_dat <- fread("data/iris_pipe_delim_edit.txt",
                  skip = 3, # homework: what happens if you don't specify skip?
                  verbose = TRUE, # default is FALSE, which I recommend
                  col.names = names(iris))
## omp_get_num_procs()==8
## R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
## R_DATATABLE_NUM_THREADS==""
## omp_get_thread_limit()==2147483647
## omp_get_max_threads()==8
## OMP_THREAD_LIMIT==""
## OMP_NUM_THREADS==""
## data.table is using 4 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
## RestoreAfterFork==true
## Input contains no \n. Taking this to be a filename to open
## [01] Check arguments
##   Using 4 threads (omp_get_max_threads()=8, nth=4)
##   NAstrings = [<<NA>>]
##   None of the NAstrings look like numbers.
##   skip num lines = 3
##   show progress = 0
##   0/1 column will be read as integer
## [02] Opening the file
##   Opening file data/iris_pipe_delim_edit.txt
##   File opened, size = 356 bytes.
##   Memory mapped ok
## [03] Detect and skip BOM
## [04] Arrange mmap to be \0 terminated
##   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
## [05] Skipping initial rows if needed
##   Skipped to line 4 in the file  Positioned on line 4 starting: <<5.1|3.5|1.4|0.2|"setosa">>
## [06] Detect separator, quoting rule, and ncolumns
##   Detecting sep automatically ...
##   sep='|'  with 6 lines of 5 fields using quote rule 0
##   Detected 5 columns on line 4. This line is either column names or first data row. Line starts as: <<5.1|3.5|1.4|0.2|"setosa">>
##   Quote rule picked = 0
##   fill=false and the most number of columns found is 5
## [07] Detect column types, good nrow estimate and whether first row is column names
##   Number of sampling jump points = 1 because (150 bytes from row 1 to eof) / (2 * 150 jump0size) == 0
##   Type codes (jump 000)    : 7777A  Quote rule 0
##   'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
##   All rows were sampled since file is small so we know nrow=6 exactly
## [08] Assign column names
## [09] Apply user overrides on column types
##   After 0 type and 0 drop user overrides : 7777A
## [10] Allocate memory for the datatable
##   Allocating 5 column slots (5 - 0 dropped) with 6 rows
## [11] Read the data
##   jumps=[0..1), chunk_size=1048576, total_size=150
## Read 6 rows x 5 columns from 356 bytes file in 00:00.000 wall clock time
## [12] Finalizing the datatable
##   Type counts:
##          4 : float64   '7'
##          1 : string    'A'
## =============================
##    0.000s (  0%) Memory map 0.000GB file
##    0.000s (  0%) sep='|' ncol=5 and header detection
##    0.000s (  0%) Column type detection using 6 sample rows
##    0.000s (  0%) Allocation of 6 rows x 5 cols (0.000GB) of which 6 (100%) rows used
##    0.000s (  0%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 6 rows) using 1 threads
##    +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
##    +    0.000s (  0%) Transpose
##    +    0.000s (  0%) Waiting
##    0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
##    0.000s        Total
pipe_dat
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          5.1         3.5          1.4         0.2  setosa
## 2:          4.9         3.0          1.4         0.2  setosa
## 3:          4.7         3.2          1.3         0.2  setosa
## 4:          4.6         3.1          1.5         0.2  setosa
## 5:          5.0         3.6          1.4         0.2  setosa
## 6:          5.4         3.9          1.7         0.4  setosa

Here’s a demonstration of how much faster fread is than read.csv and read_csv using a subset of the flights data set.

u1 <- "https://github.com/roberthryniewicz/datasets/"
u2 <- "blob/master/airline-dataset/flights/flights.csv?raw=true"
(uu <- paste0(u1, u2))
## [1] "https://github.com/roberthryniewicz/datasets/blob/master/airline-dataset/flights/flights.csv?raw=true"
system.time(dat_base <- read.csv(uu)) # timing for read.csv
##    user  system elapsed 
##   15.45    0.15   17.36
system.time(dat_readr <- read_csv(uu)) # timing for read_csv
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   UniqueCarrier = col_character(),
##   TailNum = col_character(),
##   Origin = col_character(),
##   Dest = col_character(),
##   CancellationCode = col_character()
## )
## See spec(...) for full column specifications.
##    user  system elapsed 
##    0.67    0.18    3.29
system.time(dat_fread <- fread(uu)) # timing for fread
##    user  system elapsed 
##    0.38    0.07    1.89
dim(dat_fread) # rows by columns
## [1] 100000     29

I would strongly encourage you to spend some time playing around with fread, and thoroughly investigate its arguments (it’s got a lot!).

Considering all the benefits of fread, I’m actually surprised I don’t use it more.

readxl::read_excel

library(readxl)

Maybe you’ve been unlucky enough to have to do some analysis using an excel file. This used to be a tedious task to get the data into R. Now, we can use the read_excel function from the readxl package.

Excel files will typically have multiple sheets. The excel example we’re looking at today has three separate sheets.

Reading these in is straightforward, using the read_excel function and the sheet argument.

orders <- read_excel("data/Superstore.xls",
                     sheet = "Orders")
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L2236 / R2236C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L5276 / R5276C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L8800 / R8800C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9148 / R9148C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9149 / R9149C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9150 / R9150C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9388 / R9388C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9389 / R9389C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9390 / R9390C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9391 / R9391C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9743 / R9743C12: '05408'
tail(orders)
## # A tibble: 6 x 21
##   `Row ID` `Order ID` `Order Date`        `Ship Date`         `Ship Mode`
##      <dbl> <chr>      <dttm>              <dttm>              <chr>      
## 1     9989 CA-2017-1~ 2017-11-17 00:00:00 2017-11-21 00:00:00 Standard C~
## 2     9990 CA-2014-1~ 2014-01-21 00:00:00 2014-01-23 00:00:00 Second Cla~
## 3     9991 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 4     9992 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 5     9993 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 6     9994 CA-2017-1~ 2017-05-04 00:00:00 2017-05-09 00:00:00 Second Cla~
## # ... with 16 more variables: `Customer ID` <chr>, `Customer Name` <chr>,
## #   Segment <chr>, Country <chr>, City <chr>, State <chr>, `Postal
## #   Code` <dbl>, Region <chr>, `Product ID` <chr>, Category <chr>,
## #   `Sub-Category` <chr>, `Product Name` <chr>, Sales <dbl>,
## #   Quantity <dbl>, Discount <dbl>, Profit <dbl>
returns <- read_excel("data/Superstore.xls",
                      sheet = "Returns")

tail(returns)
## # A tibble: 6 x 2
##   Returned `Order ID`    
##   <chr>    <chr>         
## 1 Yes      US-2016-140172
## 2 Yes      CA-2015-101910
## 3 Yes      CA-2017-156958
## 4 Yes      CA-2016-105585
## 5 Yes      CA-2016-148796
## 6 Yes      CA-2015-149636
people <- read_excel("data/Superstore.xls",
                     sheet = "People")

tail(people)
## # A tibble: 4 x 2
##   Person            Region 
##   <chr>             <chr>  
## 1 Anna Andreadi     West   
## 2 Chuck Magee       East   
## 3 Kelly Williams    Central
## 4 Cassandra Brandow South

A few things to note about read_excel

  • VERY IMPORTANT Sometimes the function fails if you have the file open. Make sure the excel file is closed before trying to read it into R!
  • The function will inform you of parsing issues/column type coercion.
  • By default, the function will return a tibble and not a data.frame.

Wrapping Up

In this post, we’ve looked at a few different ways of getting data into R from flat files. For nice flat files, it’s pretty straightforward to get your data into R. If your data isn’t so nice, you can generally be successful using the fread or read_csv functions, but you’ll need to be very aware of the structure of your data, as well as the arguments for whatever function decide to use.

Session Information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] readxl_1.3.1      data.table_1.12.2 readr_1.3.1      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       knitr_1.23       magrittr_1.5     hms_0.4.2       
##  [5] R6_2.4.0         rlang_0.4.0      fansi_0.4.0      stringr_1.4.0   
##  [9] tools_3.6.0      xfun_0.8         utf8_1.1.4       cli_1.1.0       
## [13] htmltools_0.3.6  assertthat_0.2.1 yaml_2.2.0       digest_0.6.19   
## [17] tibble_2.1.3     crayon_1.3.4     bookdown_0.11    vctrs_0.1.0     
## [21] zeallot_0.1.0    evaluate_0.14    rmarkdown_1.13   blogdown_0.13   
## [25] stringi_1.4.3    compiler_3.6.0   pillar_1.4.1     cellranger_1.1.0
## [29] backports_1.1.4  pkgconfig_2.0.2

comments powered by Disqus