This post was originally meant for the R Users Group at my organization. I thought it would be worthwhile to have it on my blog as well, in case anyone out there is searching for a tutorial on reading data into R.
There are a lot of different ways to get data into R, and this post highlights a few of the common ways of doing that. This post assumes you have some flat file of data (e.g. csv, txt, excel) you’re trying to read into R. Maybe I’ll write a follow-up post where the data is in a less common format.
Introduction
Today, we’re going to be taking a quick ride through a few ways to get data from flat files (txt, csv, excel) into R.
Here are links to the documentation for each of the functions discussed.
You can find all the code and data on my GitHub. If you clone that repository you should be able to run all of this on your own machine.
Examples
read.csv - basics
The first function you probably used to read data into R was read.csv
.
Let’s suppose you get a basic flat file
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
5.4,3.9,1.7,0.4,"setosa"
Using read.csv
results in:
dat <- read.csv("data/iris.txt")
tail(dat)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
But wait, something weird happened with Species
:
dat$Species
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa
read.csv
has an argument called stringsAsFactors
, and its default is TRUE. This means that any string/character type columns you have in your data will be converted to factors (further reading: stringsAsFactors: An unauthorized biography). This is generally not what we want.
So, in the event that I use read.csv
(I typically prefer readr::read_csv
or data.table::fread
, discussed below), I set stringsAsFactors = FALSE
.
dat2 <- read.csv("data/iris.txt", stringsAsFactors = FALSE)
dat2$Species
## [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
read.csv
works really well without specifying many arguments when your data is nice. What happens if your data is a little messier?
read.table - Column headers are separated, weird delimiter
Here’s the data:
"Sepal.Length"|"Sepal.Width"|"Petal.Length"|"Petal.Width"|"Species"
-----------------------------------------------------------------
-----------------------------------------------------------------
5.1|3.5|1.4|0.2|"setosa"
4.9|3|1.4|0.2|"setosa"
4.7|3.2|1.3|0.2|"setosa"
4.6|3.1|1.5|0.2|"setosa"
5|3.6|1.4|0.2|"setosa"
5.4|3.9|1.7|0.4|"setosa"
Two things are weird with this data
- Column headers are separated from the data with two rows
---
- Use
skip
argument
- Use
- A pipe (
|
) delimiter is used- Use
sep = "|"
argument
- Use
These two oddities require that we use read.table
instead of read.csv
.
# first, get column names from first row of data
column_names <- read.table("data/iris_pipe_delim_edit.txt",
sep = "|", # pipe delim
nrows = 1, # only read first row
stringsAsFactors = FALSE,
header = FALSE) # no headers
# convert to a character vector
(column_names <- as.character(column_names))
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
# then, read in remaining rows, using `skip` argument
full_data <- read.table("data/iris_pipe_delim_edit.txt",
sep = "|",
skip = 3, # skip first 3 rows
stringsAsFactors = FALSE,
col.names = column_names) # specify column names
tail(full_data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
read.csv - Missing Values are Coded Strangely
What if our data has weird missing value codes, maybe due to human input.
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,NULL,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,MISSING,1.4,0.2,"setosa"
5.4,3.9,9999,0.4,"setosa"
By default, when you’re reading in data with the read.table
/read.csv
family of functions, R treats any columns containing the string "NA"
as an NA
value. Sometimes we have missing values that take other values, like 999
, ""
, and NULL
. Using the na.strings
argument can help us with this.
dat_default_NA <- read.csv("data/iris_weird_NA_edit.txt",
stringsAsFactors = FALSE)
dat_default_NA$Sepal.Width
## [1] "3.5" "3" "3.2" "3.1" "MISSING" "3.9"
dat_default_NA$Petal.Length
## [1] "1.4" "1.4" "NULL" "1.5" "1.4" "9999"
Since R
found character values in the Sepal.Width
and Petal.Length
columns, it treats those as characters. We know this is wrong, and can fix it using na.strings
.
dat_default_NA <- read.csv("data/iris_weird_NA_edit.txt",
stringsAsFactors = FALSE,
na.strings = c("NULL", "9999", "MISSING"))
dat_default_NA$Sepal.Width
## [1] 3.5 3.0 3.2 3.1 NA 3.9
dat_default_NA$Petal.Length
## [1] 1.4 1.4 NA 1.5 1.4 NA
readr::read_csv
library(readr)
The next function I want to talk about is read_csv
from the readr
package.
This function is really helpful, and it’s pretty much my go-to function to read in flat files into R. It has good and well-reasoned defaults (no stringsAsFactors = FALSE
!), and reads in the data as a tibble
as opposed to a data.frame
. This makes printing the data to your console a lot better.
Rather than looking at the boring iris data, we’ll instead read some data from the internet. Yes, if you give one of the read
functions (even read.table
!) a url with a csv/txt file, it will be able to read that into R (conditional on you having a connection to the internet).
u1 <- "https://raw.githubusercontent.com/fivethirtyeight/data"
u2 <- "/master/college-majors/all-ages.csv"
(u <- paste0(u1, u2))
## [1] "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv"
college_data <- read_csv(u) # informative parsing printing
## Parsed with column specification:
## cols(
## Major_code = col_double(),
## Major = col_character(),
## Major_category = col_character(),
## Total = col_double(),
## Employed = col_double(),
## Employed_full_time_year_round = col_double(),
## Unemployed = col_double(),
## Unemployment_rate = col_double(),
## Median = col_double(),
## P25th = col_double(),
## P75th = col_double()
## )
college_data # nice printing of data, don't need head() or tail()
## # A tibble: 173 x 11
## Major_code Major Major_category Total Employed Employed_full_t~
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1100 GENE~ Agriculture &~ 128148 90245 74078
## 2 1101 AGRI~ Agriculture &~ 95326 76865 64240
## 3 1102 AGRI~ Agriculture &~ 33955 26321 22810
## 4 1103 ANIM~ Agriculture &~ 103549 81177 64937
## 5 1104 FOOD~ Agriculture &~ 24280 17281 12722
## 6 1105 PLAN~ Agriculture &~ 79409 63043 51077
## 7 1106 SOIL~ Agriculture &~ 6586 4926 4042
## 8 1199 MISC~ Agriculture &~ 8549 6392 5074
## 9 1301 ENVI~ Biology & Lif~ 106106 87602 65238
## 10 1302 FORE~ Agriculture &~ 69447 48228 39613
## # ... with 163 more rows, and 5 more variables: Unemployed <dbl>,
## # Unemployment_rate <dbl>, Median <dbl>, P25th <dbl>, P75th <dbl>
read_csv
has a few arguments I should highlight:
na
: Character vector of strings to interpret as missing values. (this is likena.strings
inread.csv
)skip
: Number of lines to skip before reading data.n_max
: Maximum number of records to read.col_types
: allows you to specify the column types for your data. I typically leave this as is, and let the parser do its job, but it can be helpful if you’re trying to coerce a certain column to certain data type
data.table::fread
library(data.table)
Now, sometimes you might be dealing with some really nasty data that is large and unwieldy. read_csv
is good for maybe 80-90% of data files, but sometimes we need something more powerful.
There is where the fread
function from the data.table
package comes in handy (further reading: Convenience features of fread).
By some measures, fread
can be about 6 times faster than read.csv
and about 2.5 times faster than read_csv
.
One of the best parts of fread
is that you do not necessarily have to specify the delimiter in your data.
For example, the pipe delimited data from above is read in easily. In the code below I set verbose = TRUE
to show the internal output from fread
. In general, I’d recommend you leave this as FALSE, unless you’re in serious debug mode.
pipe_dat <- fread("data/iris_pipe_delim_edit.txt",
skip = 3, # homework: what happens if you don't specify skip?
verbose = TRUE, # default is FALSE, which I recommend
col.names = names(iris))
## omp_get_num_procs()==8
## R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
## R_DATATABLE_NUM_THREADS==""
## omp_get_thread_limit()==2147483647
## omp_get_max_threads()==8
## OMP_THREAD_LIMIT==""
## OMP_NUM_THREADS==""
## data.table is using 4 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
## RestoreAfterFork==true
## Input contains no \n. Taking this to be a filename to open
## [01] Check arguments
## Using 4 threads (omp_get_max_threads()=8, nth=4)
## NAstrings = [<<NA>>]
## None of the NAstrings look like numbers.
## skip num lines = 3
## show progress = 0
## 0/1 column will be read as integer
## [02] Opening the file
## Opening file data/iris_pipe_delim_edit.txt
## File opened, size = 356 bytes.
## Memory mapped ok
## [03] Detect and skip BOM
## [04] Arrange mmap to be \0 terminated
## \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
## [05] Skipping initial rows if needed
## Skipped to line 4 in the file Positioned on line 4 starting: <<5.1|3.5|1.4|0.2|"setosa">>
## [06] Detect separator, quoting rule, and ncolumns
## Detecting sep automatically ...
## sep='|' with 6 lines of 5 fields using quote rule 0
## Detected 5 columns on line 4. This line is either column names or first data row. Line starts as: <<5.1|3.5|1.4|0.2|"setosa">>
## Quote rule picked = 0
## fill=false and the most number of columns found is 5
## [07] Detect column types, good nrow estimate and whether first row is column names
## Number of sampling jump points = 1 because (150 bytes from row 1 to eof) / (2 * 150 jump0size) == 0
## Type codes (jump 000) : 7777A Quote rule 0
## 'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
## All rows were sampled since file is small so we know nrow=6 exactly
## [08] Assign column names
## [09] Apply user overrides on column types
## After 0 type and 0 drop user overrides : 7777A
## [10] Allocate memory for the datatable
## Allocating 5 column slots (5 - 0 dropped) with 6 rows
## [11] Read the data
## jumps=[0..1), chunk_size=1048576, total_size=150
## Read 6 rows x 5 columns from 356 bytes file in 00:00.000 wall clock time
## [12] Finalizing the datatable
## Type counts:
## 4 : float64 '7'
## 1 : string 'A'
## =============================
## 0.000s ( 0%) Memory map 0.000GB file
## 0.000s ( 0%) sep='|' ncol=5 and header detection
## 0.000s ( 0%) Column type detection using 6 sample rows
## 0.000s ( 0%) Allocation of 6 rows x 5 cols (0.000GB) of which 6 (100%) rows used
## 0.000s ( 0%) Reading 1 chunks (0 swept) of 1.000MB (each chunk 6 rows) using 1 threads
## + 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times)
## + 0.000s ( 0%) Transpose
## + 0.000s ( 0%) Waiting
## 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
## 0.000s Total
pipe_dat
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 5.1 3.5 1.4 0.2 setosa
## 2: 4.9 3.0 1.4 0.2 setosa
## 3: 4.7 3.2 1.3 0.2 setosa
## 4: 4.6 3.1 1.5 0.2 setosa
## 5: 5.0 3.6 1.4 0.2 setosa
## 6: 5.4 3.9 1.7 0.4 setosa
Here’s a demonstration of how much faster fread
is than read.csv
and read_csv
using a subset of the flights data set.
u1 <- "https://github.com/roberthryniewicz/datasets/"
u2 <- "blob/master/airline-dataset/flights/flights.csv?raw=true"
(uu <- paste0(u1, u2))
## [1] "https://github.com/roberthryniewicz/datasets/blob/master/airline-dataset/flights/flights.csv?raw=true"
system.time(dat_base <- read.csv(uu)) # timing for read.csv
## user system elapsed
## 15.45 0.15 17.36
system.time(dat_readr <- read_csv(uu)) # timing for read_csv
## Parsed with column specification:
## cols(
## .default = col_double(),
## UniqueCarrier = col_character(),
## TailNum = col_character(),
## Origin = col_character(),
## Dest = col_character(),
## CancellationCode = col_character()
## )
## See spec(...) for full column specifications.
## user system elapsed
## 0.67 0.18 3.29
system.time(dat_fread <- fread(uu)) # timing for fread
## user system elapsed
## 0.38 0.07 1.89
dim(dat_fread) # rows by columns
## [1] 100000 29
I would strongly encourage you to spend some time playing around with fread
, and thoroughly investigate its arguments (it’s got a lot!).
Considering all the benefits of fread
, I’m actually surprised I don’t use it more.
readxl::read_excel
library(readxl)
Maybe you’ve been unlucky enough to have to do some analysis using an excel file. This used to be a tedious task to get the data into R. Now, we can use the read_excel
function from the readxl
package.
Excel files will typically have multiple sheets. The excel example we’re looking at today has three separate sheets.
Reading these in is straightforward, using the read_excel
function and the sheet
argument.
orders <- read_excel("data/Superstore.xls",
sheet = "Orders")
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L2236 / R2236C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L5276 / R5276C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L8800 / R8800C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9148 / R9148C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9149 / R9149C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9150 / R9150C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9388 / R9388C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9389 / R9389C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9390 / R9390C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9391 / R9391C12: '05408'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Coercing text to numeric in L9743 / R9743C12: '05408'
tail(orders)
## # A tibble: 6 x 21
## `Row ID` `Order ID` `Order Date` `Ship Date` `Ship Mode`
## <dbl> <chr> <dttm> <dttm> <chr>
## 1 9989 CA-2017-1~ 2017-11-17 00:00:00 2017-11-21 00:00:00 Standard C~
## 2 9990 CA-2014-1~ 2014-01-21 00:00:00 2014-01-23 00:00:00 Second Cla~
## 3 9991 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 4 9992 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 5 9993 CA-2017-1~ 2017-02-26 00:00:00 2017-03-03 00:00:00 Standard C~
## 6 9994 CA-2017-1~ 2017-05-04 00:00:00 2017-05-09 00:00:00 Second Cla~
## # ... with 16 more variables: `Customer ID` <chr>, `Customer Name` <chr>,
## # Segment <chr>, Country <chr>, City <chr>, State <chr>, `Postal
## # Code` <dbl>, Region <chr>, `Product ID` <chr>, Category <chr>,
## # `Sub-Category` <chr>, `Product Name` <chr>, Sales <dbl>,
## # Quantity <dbl>, Discount <dbl>, Profit <dbl>
returns <- read_excel("data/Superstore.xls",
sheet = "Returns")
tail(returns)
## # A tibble: 6 x 2
## Returned `Order ID`
## <chr> <chr>
## 1 Yes US-2016-140172
## 2 Yes CA-2015-101910
## 3 Yes CA-2017-156958
## 4 Yes CA-2016-105585
## 5 Yes CA-2016-148796
## 6 Yes CA-2015-149636
people <- read_excel("data/Superstore.xls",
sheet = "People")
tail(people)
## # A tibble: 4 x 2
## Person Region
## <chr> <chr>
## 1 Anna Andreadi West
## 2 Chuck Magee East
## 3 Kelly Williams Central
## 4 Cassandra Brandow South
A few things to note about read_excel
- VERY IMPORTANT Sometimes the function fails if you have the file open. Make sure the excel file is closed before trying to read it into R!
- The function will inform you of parsing issues/column type coercion.
- By default, the function will return a
tibble
and not adata.frame
.
Wrapping Up
In this post, we’ve looked at a few different ways of getting data into R from flat files. For nice flat files, it’s pretty straightforward to get your data into R. If your data isn’t so nice, you can generally be successful using the fread
or read_csv
functions, but you’ll need to be very aware of the structure of your data, as well as the arguments for whatever function decide to use.
Further Reading
Here are the links I’ve referenced.
Articles
Package Sites
Function Documentation
Session Information
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] readxl_1.3.1 data.table_1.12.2 readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 knitr_1.23 magrittr_1.5 hms_0.4.2
## [5] R6_2.4.0 rlang_0.4.0 fansi_0.4.0 stringr_1.4.0
## [9] tools_3.6.0 xfun_0.8 utf8_1.1.4 cli_1.1.0
## [13] htmltools_0.3.6 assertthat_0.2.1 yaml_2.2.0 digest_0.6.19
## [17] tibble_2.1.3 crayon_1.3.4 bookdown_0.11 vctrs_0.1.0
## [21] zeallot_0.1.0 evaluate_0.14 rmarkdown_1.13 blogdown_0.13
## [25] stringi_1.4.3 compiler_3.6.0 pillar_1.4.1 cellranger_1.1.0
## [29] backports_1.1.4 pkgconfig_2.0.2