RcppSimdJSON: Rcpp Bindings for the simdjson Header Library

Build Status License CRAN Dependencies Last Commit

Motivation

simdjson by Daniel Lemire (with contributions by Geoff Langdale, John Keiser and many others) is an engineering marvel. Through very clever use of SIMD instructions, it manages to parse JSON files faster than disc access. Wut? Yes you read that right: parallel processing with so little overhead that the net throughput is limited only by disk speed.

Moreover, it is implemented in neat modern C++ and can be accessed as a header-only library. (Well, one library in two files, really.) Which makes R packaging easy and convenient and compelling. So here we are.

For further introduction, see the arXiv paper by Langdale and Lemire (out/to appear in VLDB Journal 28(6) as well) and/or the video of the recent talk by Daniel Lemire at QCon (voted best talk).

Example

jsonfile <- system.file("jsonexamples", "twitter.json", package="RcppSimdJson")
validateJSON(jsonfile)

Comparison

A simple benchmark against four other R-accessible JSON parsers:

R> print(res, order="median")
Unit: microseconds
     expr       min         lq       mean    median         uq        max neval   cld
 simdjson   279.246    332.577    390.815    362.11    427.638    648.652   100 a    
  jsonify  2820.079   2930.945   3064.773   3027.28   3153.427   3986.948   100  b   
 jsonlite  8899.379   9085.685   9273.974   9226.56   9349.513  10820.562   100   c  
  RJSONIO  9685.246   9899.634  10185.272  10105.96  10296.579  11766.177   100    d
   ndjson 99460.979 100381.388 101758.682 100971.75 102613.041 111553.986   100     e
R> print(res, order="median", unit="relative")
Unit: relative
     expr      min        lq      mean    median        uq       max neval   cld
 simdjson   1.0000   1.00000   1.00000   1.00000   1.00000   1.00000   100 a    
  jsonify  10.0989   8.81284   7.84201   8.36011   7.37406   6.14651   100  b   
 jsonlite  31.8693  27.31908  23.72986  25.48003  21.86315  16.68161   100   c  
  RJSONIO  34.6836  29.76649  26.06165  27.90857  24.07779  18.13943   100    d
   ndjson 356.1769 301.82947 260.37585 278.84314 239.95305 171.97817   100     e
R>

Or in chart form:

Note that these timings came from the very beginnings of the package. Admittance to CRAN meant turning off one particular optimisation (‘computed GOTOs’) by default resulting in slightly slower performance. You can get the behaviour back locally by removing the -DSIMDJSON_NO_COMPUTED_GOTO term from src/Makevars.in.

Status

Minimally viable. Right now it builds, wraps the validation test, and checks cleanly as an R package. As of version 0.0.4, basic parsing is supported, see parseExample(). Requires a C++17 compiler. Expect changes. But please feel free to contribute.

Contributing

Any problems, bug reports, or features requests for the package can be submitted and handled most conveniently as Github issues in the repository.

Before submitting pull requests, it is frequently preferable to first discuss need and scope in such an issue ticket. See the file Contributing.md (in the Rcpp repo) for a brief discussion.

See Also

For standard JSON work on R, as well as for other nicely done C++ libraries, consider these:

Author

For the R package wrapper, Dirk Eddelbuettel.

For everything pertaining to simdjson, Daniel Lemire (and many contributors) .