Vroom Benchmarks

vroom is a new approach to reading delimited and fixed width data into R.

It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.

Therefore you can obtain very rapid input by first performing a fast indexing step and then using the Altrep framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.

How it works

The initial reading of the file simply records the locations of each individual record, the actual values are not read into R. Altrep vectors are created for each column in the data which hold a pointer to the index and the memory mapped file. When these vectors are indexed the value is read from the memory mapping.

This means initial reading is extremely fast, in the real world dataset below it is ~ 1/4 the time of the multi-threaded data.table::fread(). Sampling operations are likewise extremely fast, as only the data actually included in the sample is read. This means things like the tibble print method, calling head(), tail() x[sample(), ] etc. have very low overhead. Filtering also can be fast, only the columns included in the filter selection have to be fully read and only the data in the filtered rows needs to be read from the remaining columns. Grouped aggregations likewise only need to read the grouping variables and the variables aggregated.

Once a particular vector is fully materialized the speed for all subsequent operations should be identical to a normal R vector.

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

Reading delimited files

The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.

Because the read.delim results are so much slower than the others they are excluded from the plots, but are retained in the tables.

Taxi Trip Dataset

This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at http://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.

The first table trip_fare_1.csv is 1.55G in size.

#> Observations: 14,776,615
#> Variables: 11
#> $ medallion       <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license    <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id       <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type    <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount     <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge       <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax         <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount    <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...

Taxi Benchmarks

code: bench/taxi

All benchmarks were run on a Amazon EC2 m5.4xlarge instance with 16 vCPUs and an EBS volume type.

The benchmarks labeled vroom_base uses vroom with base functions for manipulation. vroom_dplyr uses vroom to read the file and dplyr functions to manipulate. data.table uses fread() to read the file and data.table functions to manipulate and readr uses readr to read the file and dplyr to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal). The benchmarks labeled vroom(altrep: full) instead use Altrep vectors for all supported types and vroom(altrep: none) disable Altrep entirely.

The following operations are performed.

  • The data is read
  • print() - N.B. read.delim uses print(head(x, 10)) because printing the whole dataset takes > 10 minutes
  • head()
  • tail()
  • Sampling 100 random rows
  • Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
  • Aggregation of mean fare amount per payment type.
reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 6.07GB 1m 6.7s 6ms 1ms 1ms 1ms 314ms 950ms 1m 8s
readr dplyr 5.82GB 29s 85ms 1ms 1ms 3ms 191ms 733ms 30s
vroom dplyr FALSE 6.18GB 15.6s 98ms 1ms 1ms 2ms 840ms 838ms 17.4s
data.table data.table 6.47GB 11.9s 12ms 1ms 1ms 1ms 133ms 218ms 12.2s
vroom base TRUE 7.8GB 1.1s 125ms 1ms 1ms 1ms 1.4s 8.7s 11.3s
vroom dplyr TRUE 8.5GB 1.2s 182ms 1ms 1ms 1.2s 1.5s 2.3s 6.5s

(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter() or sample_n(), which is why the first of these cases have additional overhead when using full Altrep.).

All numeric data

All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.

For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.

However because vroom is multi-threaded it is a bit quicker than readr and read.delim for this type of data.

Long

code: bench/all_numeric-long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 4.78GB 1m 51.3s 1.5s 1ms 1ms 1ms 4.6s 39ms 1m 57.4s
readr dplyr 2.82GB 13.3s 85ms 1ms 1ms 4ms 12ms 32ms 13.4s
vroom base FALSE 2.67GB 1.1s 102ms 1ms 1ms 3ms 58ms 55ms 1.4s
vroom dplyr TRUE 3.45GB 261ms 171ms 1ms 1ms 830ms 12ms 38ms 1.3s
vroom dplyr FALSE 2.74GB 1.1s 107ms 1ms 1ms 4ms 11ms 31ms 1.2s
vroom base TRUE 3.21GB 297ms 165ms 1ms 1ms 3ms 55ms 269ms 788ms
data.table data.table 2.99GB 244ms 13ms 1ms 1ms 3ms 6ms 24ms 289ms

Wide

code: bench/all_numeric-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 14.4GB 8m 43s 167ms 6ms 6ms 9ms 76ms 5ms 8m 43.3s
readr dplyr 5.44GB 56.2s 122ms 13ms 2ms 5ms 5ms 21ms 56.4s
vroom base FALSE 5.45GB 5.3s 168ms 2ms 2ms 5ms 77ms 7ms 5.5s
vroom dplyr TRUE 7.97GB 1s 149ms 3ms 3ms 4.3s 7ms 19ms 5.5s
vroom dplyr FALSE 5.32GB 5.2s 181ms 2ms 2ms 5ms 5ms 18ms 5.4s
data.table data.table 5.8GB 1.3s 112ms 3ms 3ms 4ms 4ms 4ms 1.4s
vroom base TRUE 7.22GB 1s 184ms 4ms 4ms 5ms 89ms 46ms 1.3s

All character data

code: bench/all_character-long

All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.

Long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 4.39GB 1m 39.3s 8ms 1ms 1ms 2ms 24ms 414ms 1m 39.8s
readr dplyr 4.32GB 1m 2.2s 83ms 1ms 1ms 4ms 15ms 401ms 1m 2.7s
vroom dplyr FALSE 4.29GB 48.7s 108ms 1ms 1ms 4ms 15ms 311ms 49.2s
data.table data.table 4.71GB 37.4s 14ms 1ms 1ms 3ms 12ms 240ms 37.7s
vroom base TRUE 3.2GB 278ms 172ms 1ms 1ms 3ms 169ms 2s 2.6s
vroom dplyr TRUE 3.2GB 275ms 160ms 1ms 1ms 5ms 161ms 1.2s 1.8s

Wide

code: bench/all_character-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 13.05GB 8m 33.5s 156ms 6ms 6ms 12ms 130ms 64ms 8m 33.9s
readr dplyr 12.21GB 5m 46.8s 127ms 3ms 3ms 10ms 79ms 53ms 5m 47.1s
vroom dplyr FALSE 12.11GB 3m 54s 157ms 3ms 3ms 12ms 64ms 36ms 3m 54.3s
data.table data.table 12.62GB 2m 39.3s 150ms 1ms 1ms 9ms 56ms 24ms 2m 39.6s
vroom dplyr TRUE 6.6GB 1s 138ms 7ms 3ms 30ms 864ms 136ms 2.2s
vroom base TRUE 6.58GB 947ms 174ms 4ms 4ms 5ms 133ms 300ms 1.6s

Reading multiple delimited files

code: bench/taxi_multiple

The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 11 columns for a total file size of 18.4G.

reading package manipulating package altrep memory read print head tail sample filter aggregate total
readr dplyr 69.9GB 7m 26.2s 84ms 1ms 1ms 3ms 2.2s 11.1s 7m 39.6s
data.table data.table 60.2GB 3m 57.3s 6ms 1ms 1ms 1ms 2.8s 11s 4m 11.1s
vroom dplyr FALSE 54.9GB 3m 35.8s 142ms 1ms 1ms 2ms 2.2s 9.8s 3m 48s
vroom base TRUE 75.6GB 12.8s 135ms 1ms 1ms 1ms 20.5s 2m 5.7s 2m 39.1s
vroom dplyr TRUE 80.2GB 12.7s 161ms 1ms 1ms 11.4s 21.3s 27.5s 1m 13.1s

Reading fixed width files

United States Census 5-Percent Public Use Microdata Sample files

This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.

The data totals 2,342,339 rows and 37 columns with a total file size of 677M.

Census data benchmarks

code: bench/fwf

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 6.1GB 18m 7.4s 16ms 1ms 1ms 3ms 225ms 123ms 18m 7.7s
readr dplyr 5.6GB 26.1s 143ms 1ms 1ms 19ms 99ms 161ms 26.5s
vroom dplyr FALSE 5.35GB 14.5s 116ms 1ms 1ms 7ms 172ms 525ms 15.3s
vroom dplyr TRUE 5.27GB 143ms 147ms 1ms 1ms 2.7s 446ms 461ms 3.9s
vroom base TRUE 3.98GB 144ms 168ms 1ms 1ms 6ms 482ms 1.5s 2.3s

Writing delimited files

code: bench/taxi_writing

The benchmarks write out the taxi trip dataset in a few different ways.

compression base data.table readr vroom
gzip 3m 21.3s 1m 6.1s 2m 2.5s 1m 14.5s
multithreaded_gzip 1m 46.1s 8.7s 53.9s 8.5s
zstandard 1m 43.2s NA 51.6s 13.4s
uncompressed 1m 42.2s 1.6s 51.7s 1.7s

Session and package information

package version date source
base 3.6.2 2019-12-12 local
data.table 1.12.8 2019-12-09 RSPM (R 3.6.1)
dplyr 0.8.3 2019-07-04 RSPM (R 3.6.2)
readr 1.3.1 2018-12-21 RSPM (R 3.6.2)
vroom 1.1.0 2020-01-10 local