How it works

The initial reading of the file simply records the locations of each individual record, the actual values are not read into R. Altrep vectors are created for each column in the data which hold a pointer to the index and the memory mapped file. When these vectors are indexed the value is read from the memory mapping.

This means initial reading is extremely fast, in the real world dataset below it is ~ 1/4 the time of the multi-threaded data.table::fread(). Sampling operations are likewise extremely fast, as only the data actually included in the sample is read. This means things like the tibble print method, calling head(), tail() x[sample(), ] etc. have very low overhead. Filtering also can be fast, only the columns included in the filter selection have to be fully read and only the data in the filtered rows needs to be read from the remaining columns. Grouped aggregations likewise only need to read the grouping variables and the variables aggregated.

Once a particular vector is fully materialized the speed for all subsequent operations should be identical to a normal R vector.

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

Reading delimited files

The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.

Because the read.delim results are so much slower than the others they are excluded from the plots, but are retained in the tables.

Taxi Trip Dataset

This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at http://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.

The first table trip_fare_1.csv is 1.55G in size.

#> Observations: 14,776,615
#> Variables: 11
#> $ medallion       <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license    <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id       <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type    <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount     <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge       <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax         <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount    <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...

Taxi Benchmarks

code: bench/taxi

All benchmarks were run on a Amazon EC2 m5.4xlarge instance with 16 vCPUs and an EBS volume type.

The benchmarks labeled vroom_base uses vroom with base functions for manipulation. vroom_dplyr uses vroom to read the file and dplyr functions to manipulate. data.table uses fread() to read the file and data.table functions to manipulate and readr uses readr to read the file and dplyr to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal). The benchmarks labeled vroom(altrep: full) instead use Altrep vectors for all supported types and vroom(altrep: none) disable Altrep entirely.

The following operations are performed.

The data is read
print() - N.B. read.delim uses print(head(x, 10)) because printing the whole dataset takes > 10 minutes
head()
tail()
Sampling 100 random rows
Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
Aggregation of mean fare amount per payment type.

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		6.07GB	1m 6.7s	6ms	1ms	1ms	1ms	314ms	950ms	1m 8s
readr	dplyr		5.82GB	29s	85ms	1ms	1ms	3ms	191ms	733ms	30s
vroom	dplyr	FALSE	6.18GB	15.6s	98ms	1ms	1ms	2ms	840ms	838ms	17.4s
data.table	data.table		6.47GB	11.9s	12ms	1ms	1ms	1ms	133ms	218ms	12.2s
vroom	base	TRUE	7.8GB	1.1s	125ms	1ms	1ms	1ms	1.4s	8.7s	11.3s
vroom	dplyr	TRUE	8.5GB	1.2s	182ms	1ms	1ms	1.2s	1.5s	2.3s	6.5s

(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter() or sample_n(), which is why the first of these cases have additional overhead when using full Altrep.).

All numeric data

All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.

For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.

However because vroom is multi-threaded it is a bit quicker than readr and read.delim for this type of data.

Long

code: bench/all_numeric-long

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		4.78GB	1m 51.3s	1.5s	1ms	1ms	1ms	4.6s	39ms	1m 57.4s
readr	dplyr		2.82GB	13.3s	85ms	1ms	1ms	4ms	12ms	32ms	13.4s
vroom	base	FALSE	2.67GB	1.1s	102ms	1ms	1ms	3ms	58ms	55ms	1.4s
vroom	dplyr	TRUE	3.45GB	261ms	171ms	1ms	1ms	830ms	12ms	38ms	1.3s
vroom	dplyr	FALSE	2.74GB	1.1s	107ms	1ms	1ms	4ms	11ms	31ms	1.2s
vroom	base	TRUE	3.21GB	297ms	165ms	1ms	1ms	3ms	55ms	269ms	788ms
data.table	data.table		2.99GB	244ms	13ms	1ms	1ms	3ms	6ms	24ms	289ms

Wide

code: bench/all_numeric-wide

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		14.4GB	8m 43s	167ms	6ms	6ms	9ms	76ms	5ms	8m 43.3s
readr	dplyr		5.44GB	56.2s	122ms	13ms	2ms	5ms	5ms	21ms	56.4s
vroom	base	FALSE	5.45GB	5.3s	168ms	2ms	2ms	5ms	77ms	7ms	5.5s
vroom	dplyr	TRUE	7.97GB	1s	149ms	3ms	3ms	4.3s	7ms	19ms	5.5s
vroom	dplyr	FALSE	5.32GB	5.2s	181ms	2ms	2ms	5ms	5ms	18ms	5.4s
data.table	data.table		5.8GB	1.3s	112ms	3ms	3ms	4ms	4ms	4ms	1.4s
vroom	base	TRUE	7.22GB	1s	184ms	4ms	4ms	5ms	89ms	46ms	1.3s

All character data

code: bench/all_character-long

All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.

Long

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		4.39GB	1m 39.3s	8ms	1ms	1ms	2ms	24ms	414ms	1m 39.8s
readr	dplyr		4.32GB	1m 2.2s	83ms	1ms	1ms	4ms	15ms	401ms	1m 2.7s
vroom	dplyr	FALSE	4.29GB	48.7s	108ms	1ms	1ms	4ms	15ms	311ms	49.2s
data.table	data.table		4.71GB	37.4s	14ms	1ms	1ms	3ms	12ms	240ms	37.7s
vroom	base	TRUE	3.2GB	278ms	172ms	1ms	1ms	3ms	169ms	2s	2.6s
vroom	dplyr	TRUE	3.2GB	275ms	160ms	1ms	1ms	5ms	161ms	1.2s	1.8s

Wide

code: bench/all_character-wide

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		13.05GB	8m 33.5s	156ms	6ms	6ms	12ms	130ms	64ms	8m 33.9s
readr	dplyr		12.21GB	5m 46.8s	127ms	3ms	3ms	10ms	79ms	53ms	5m 47.1s
vroom	dplyr	FALSE	12.11GB	3m 54s	157ms	3ms	3ms	12ms	64ms	36ms	3m 54.3s
data.table	data.table		12.62GB	2m 39.3s	150ms	1ms	1ms	9ms	56ms	24ms	2m 39.6s
vroom	dplyr	TRUE	6.6GB	1s	138ms	7ms	3ms	30ms	864ms	136ms	2.2s
vroom	base	TRUE	6.58GB	947ms	174ms	4ms	4ms	5ms	133ms	300ms	1.6s

Reading multiple delimited files

code: bench/taxi_multiple

The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 11 columns for a total file size of 18.4G.

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
readr	dplyr		69.9GB	7m 26.2s	84ms	1ms	1ms	3ms	2.2s	11.1s	7m 39.6s
data.table	data.table		60.2GB	3m 57.3s	6ms	1ms	1ms	1ms	2.8s	11s	4m 11.1s
vroom	dplyr	FALSE	54.9GB	3m 35.8s	142ms	1ms	1ms	2ms	2.2s	9.8s	3m 48s
vroom	base	TRUE	75.6GB	12.8s	135ms	1ms	1ms	1ms	20.5s	2m 5.7s	2m 39.1s
vroom	dplyr	TRUE	80.2GB	12.7s	161ms	1ms	1ms	11.4s	21.3s	27.5s	1m 13.1s

Reading fixed width files

United States Census 5-Percent Public Use Microdata Sample files

This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.

The data totals 2,342,339 rows and 37 columns with a total file size of 677M.

Census data benchmarks

code: bench/fwf

reading package	manipulating package	altrep	memory	read	print	head	tail	sample	filter	aggregate	total
read.delim	base		6.1GB	18m 7.4s	16ms	1ms	1ms	3ms	225ms	123ms	18m 7.7s
readr	dplyr		5.6GB	26.1s	143ms	1ms	1ms	19ms	99ms	161ms	26.5s
vroom	dplyr	FALSE	5.35GB	14.5s	116ms	1ms	1ms	7ms	172ms	525ms	15.3s
vroom	dplyr	TRUE	5.27GB	143ms	147ms	1ms	1ms	2.7s	446ms	461ms	3.9s
vroom	base	TRUE	3.98GB	144ms	168ms	1ms	1ms	6ms	482ms	1.5s	2.3s

Writing delimited files

code: bench/taxi_writing

The benchmarks write out the taxi trip dataset in a few different ways.

An uncompressed file
A gzip compressed file using gzfile() (readr and vroom do this automatically for files ending in .gz)
A gzip compressed file compressed with multiple threads (natively for data.table and using a pipe() connection to pigz for the rest).
A Zstandard compressed file (data.table does not support this format).

compression	base	data.table	readr	vroom
gzip	3m 21.3s	1m 6.1s	2m 2.5s	1m 14.5s
multithreaded_gzip	1m 46.1s	8.7s	53.9s	8.5s
zstandard	1m 43.2s	NA	51.6s	13.4s
uncompressed	1m 42.2s	1.6s	51.7s	1.7s

Session and package information

package	version	date	source
base	3.6.2	2019-12-12	local
data.table	1.12.8	2019-12-09	RSPM (R 3.6.1)
dplyr	0.8.3	2019-07-04	RSPM (R 3.6.2)
readr	1.3.1	2018-12-21	RSPM (R 3.6.2)
vroom	1.1.0	2020-01-10	local

Vroom Benchmarks

How it works

Reading delimited files

Taxi Trip Dataset

Taxi Benchmarks

All numeric data

Long

Wide

All character data

Long

Wide

Reading multiple delimited files

Reading fixed width files

United States Census 5-Percent Public Use Microdata Sample files

Census data benchmarks

Writing delimited files

Session and package information