simdjson by Daniel Lemire (with contributions by Geoff Langdale, John Keiser and many others) is an engineering marvel. Through very clever use of SIMD instructions, it manages to parse JSON files faster than disc access. Wut? Yes you read that right: parallel processing with so little overhead that the net throughput is limited only by disk speed.
Moreover, it is implemented in neat modern C++ and can be accessed as a header-only library. (Well, one library in two files, really.) Which makes R packaging easy and convenient and compelling. So here we are.
For further introduction, see the arXiv paper by Langdale and Lemire (out/to appear in VLDB Journal 28(6) as well) and/or the video of the recent talk by Daniel Lemire at QCon (voted best talk).
jsonfile <- system.file("jsonexamples", "twitter.json", package="RcppSimdJson")
validateJSON(jsonfile)
A simple benchmark against four other R-accessible JSON parsers:
R> print(res, order="median")
Unit: microseconds
expr min lq mean median uq max neval cld
simdjson 279.246 332.577 390.815 362.11 427.638 648.652 100 a
jsonify 2820.079 2930.945 3064.773 3027.28 3153.427 3986.948 100 b
jsonlite 8899.379 9085.685 9273.974 9226.56 9349.513 10820.562 100 c
RJSONIO 9685.246 9899.634 10185.272 10105.96 10296.579 11766.177 100 d
ndjson 99460.979 100381.388 101758.682 100971.75 102613.041 111553.986 100 e
R> print(res, order="median", unit="relative")
Unit: relative
expr min lq mean median uq max neval cld
simdjson 1.0000 1.00000 1.00000 1.00000 1.00000 1.00000 100 a
jsonify 10.0989 8.81284 7.84201 8.36011 7.37406 6.14651 100 b
jsonlite 31.8693 27.31908 23.72986 25.48003 21.86315 16.68161 100 c
RJSONIO 34.6836 29.76649 26.06165 27.90857 24.07779 18.13943 100 d
ndjson 356.1769 301.82947 260.37585 278.84314 239.95305 171.97817 100 e
R>
Or in chart form:
Note that these timings came from the very beginnings of the package. Admittance to CRAN meant turning off one particular optimisation (‘computed GOTOs’) by default resulting in slightly slower performance. You can get the behaviour back locally by removing the -DSIMDJSON_NO_COMPUTED_GOTO
term from src/Makevars.in.
Minimally viable. Right now it builds, wraps the validation test, and checks cleanly as an R package. As of version 0.0.4, basic parsing is supported, see parseExample()
. Requires a C++17 compiler. Expect changes. But please feel free to contribute.
Any problems, bug reports, or features requests for the package can be submitted and handled most conveniently as Github issues in the repository.
Before submitting pull requests, it is frequently preferable to first discuss need and scope in such an issue ticket. See the file Contributing.md (in the Rcpp repo) for a brief discussion.
For standard JSON work on R, as well as for other nicely done C++ libraries, consider these:
For the R package wrapper, Dirk Eddelbuettel.
For everything pertaining to simdjson, Daniel Lemire (and many contributors) .