hdd
provides a class of data, hard drive data, allowing the easy importation/manipulation of out of memory data sets. The data sets are located on disk but look like in-memory, the syntax for manipulation is similar to data.table. Operations are performed “chunk-wise” behind the scene. Here is a brief presentation of the main features.
Throughout this document, we will use the example of the Microsoft Academic Graph data (https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/). This large and freely available data set contains all scientific publications meta information (like authors, titles, institutions, etc…) as collected by Microsoft and used in microsoft academic.
The data is in the form of a relational data base of usually large text files (well over 10GB). We’ll see how to deal with them in R
with hdd
.
First we have to import the data into R
through a hdd
data set. We’re interested in importing the information on authors. We’ll use the txt2hdd
function. But first let’s have a look at the data.
library(hdd)
peek("_path/authors.txt")
#> Delimiter: TSV
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
---|---|---|---|---|---|---|---|
584 | 20207 | gozde ozdikmenlidemir | Gözde Özdikmenli-Demir | 79946792 | 2 | 0 | 2016-06-24 |
859 | 20146 | gy tolmar | Gy. Tolmár | NA | 2 | 1 | 2016-06-24 |
978 | 17783 | ximena faundez | Ximena Faúndez | 162148367 | 18 | 42 | 2016-06-24 |
1139 | 19190 | jennifer putzi | Jennifer Putzi | NA | 4 | 16 | 2016-06-24 |
1611 | 20207 | ç´”å è²åŽŸ | ç´”å è²åŽŸ | NA | 2 | 0 | 2016-06-24 |
1799 | 18615 | hossein gholaman | Hossein Gholaman | 62318514 | 2 | 179 | 2016-06-24 |
We can see that the data set contains 8 variables, some are text, other are numeric. By default, the function peek
also displays the delimiter of the data, here it is a tab delimited text file.
Now let’s import the data. From the documentation, we can retrieve the variable names, and we will use them:
col_names = c("AuthorId", "Rank", "NormalizedName", "DisplayName",
"LastKnownAffiliationId",
"PaperCount", "CitationCount", "CreatedDate")
txt2hdd("_path/authors.txt", # The text file
# dirDest: The destination of the HDD data => must be a directory
dirDest = "_path/hdd_authors",
chunkMB = 500, col_names = col_names)
By default, the types of the variables are automatically set, based on the guess of the function fread
from package data.table
. Note that 64bits integers variables are imported as doubles. You can set the column types yourself using the function cols
or cols_only
from package readr
. In any case, if there are importing problems, a specific repository, which is itself a hdd file, reporting all importing problems is created (in the example, it there were problems, it would be located at "_path/hdd_authors/problems"
).
The function txt2hdd
creates a folder on disk containing the full data set divided into several files named slice_XX.fst
with XX its number. The .fst
files are in a format for fast reading/writing on disk (see fst
homepage). Every time a hdd
data set is created, the associated folder on disk includes a file _hdd.txt
containing: i) a summary of the data (number of rows/columns, first five observations) and ii) the log of the commands which created it.
Now that the data is imported into a hdd
file, let’s have a look at it:
authors = hdd("_path/hdd_authors")
summary(authors)
#> Hard drive data of 13,031 MB. Made of 69 files.
#> Location: C:/Users/laurent.berge/DATA/MAG/HDD/authors/
#> 217,832,703 lines, 8 variables.
The summary provides some general information: the location of the data in the hard drive (note that this is the location on my hard drive!), the size of the data set on disk (which is lower that what it would be in-memory due to compression), here 13GB, the number of chunks, here 69, the number of lines, here 217 millions, and the numer of variables.
Now let’s have a quick look at the first lines of the data:
head(authors)
#> AuthorId Rank NormalizedName DisplayName
#> 1: 584 20207 gozde ozdikmenlidemir Gözde Özdikmenli-Demir
#> 2: 859 20146 gy tolmar Gy. Tolmár
#> 3: 978 17783 ximena faundez Ximena Faúndez
#> 4: 1139 19190 jennifer putzi Jennifer Putzi
#> 5: 1611 20207 ç´”å\220 è²\235原 <U+7D14><U+5B50> <U+8C9D><U+539F>
#> 6: 1799 18615 hossein gholaman Hossein Gholaman
#> LastKnownAffiliationId PaperCount CitationCount CreatedDate
#> 1: 79946792 2 0 2016-06-24
#> 2: <NA> 2 1 2016-06-24
#> 3: 162148367 18 42 2016-06-24
#> 4: <NA> 4 16 2016-06-24
#> 5: <NA> 2 0 2016-06-24
#> 6: 62318514 2 179 2016-06-24
It’s indeed the same data as in the text file. You can see that you can access it as a regular data frame.
Now assume that you do not want to import the full text file because some information might be unecessary to you – or you may want to generate new information straight away. You can apply a preprocessing function while importing. Assume we want to import only the first three columns for which the names, in variable NormalizedName
, contain only ASCII characters. We could do as follows:
fun_ascii = function(x){
# selection of the first 3 columns
res = x[, 1:3]
# selection of only ascii names
res[!is.na(iconv(NormalizedName, to = "ASCII"))]
}
col_names = c("AuthorId", "Rank", "NormalizedName", "DisplayName",
"LastKnownAffiliationId",
"PaperCount", "CitationCount", "CreatedDate")
txt2hdd("_path/authors.txt", dirDest = "_path/hdd_authors_ascii",
chunkMB = 500, col_names = col_names,
preprocessfun = fun_ascii)
Let’s look at the new data set:
authors_ascii = hdd("_path/hdd_authors_ascii")
head(authors_ascii)
#> AuthorId Rank NormalizedName
#> 1: 584 20207 gozde ozdikmenlidemir
#> 2: 859 20146 gy tolmar
#> 3: 978 17783 ximena faundez
#> 4: 1139 19190 jennifer putzi
#> 5: 1799 18615 hossein gholaman
#> 6: 1968 19514 maria isabel lorca martin de villodres
You can manipulate the data as with any other data.table
, but the extraction method for hdd
objects ([.hdd
) includes a few extra arguments.
By default, the results are put into memory. Using the previous author
data, let’s find out all the author names containing the word “Einstein”:
names_einstein = authors[grepl("\\beinstein\\b", NormalizedName),
NormalizedName]
length(names_einstein)
head(names_einstein)
#> [1] 1700
#> [1] "einstein arulraj" "einstein yehosua"
#> [3] "evans einstein william tulungen" "a einstein"
#> [5] "gilles o einstein" "z einstein"
That’s it, the algorithm has gone through the 217 million rows and found 1700 author names containing “Einstein”. You can see that the command is the same as for a regular data.table
.
But what if the result of the query does not fit into memory? You can still perform the query by adding the argument newfile
. Now the result will be a hdd
data set located in the path provided by the argument newfile
. As in the "Importing with preprocessing"
section, let’s create the data set containing the first three columns and dropping all names with non-ASCII characters. We can do as follows:
authors[!is.na(iconv(NormalizedName, to = "ASCII")), 1:3,
newfile = "_path/hdd_authors_ascii"]
The result is a new hdd
data set located in "_path/hdd_authors_ascii"
, which can be of any size.
A hdd
data set is made of several chunks, or files. You can explore each of them individually using the argument file
. Further, you can use the special variable .N
to refer to the total number of files making the data set. For example, let’s select the first name of each chunk (or file):
names_first = authors[1, NormalizedName, file = 1:.N]
head(names_first)
#> [1] "gozde ozdikmenlidemir" "louise leiris" "sonja trifunov"
#> [4] "gia avalishvili" "a k fazlullah khan" "jin hwan hong"
When you use the argument file
, you can also use the special variable .N
in the index. Here by selecting the last lines of each file:
names_last = authors[.N, NormalizedName, file = 1:.N]
head(names_last)
#> [1] "t de mello cintra lavagnolli" "marcus cieleback"
#> [3] "inmaculada melchor" "wayne tunnicliffe"
#> [5] "erich neu" "s itzchaky"
Of course you can extract a full variable with $
, but the algorithm will proceed only if the expected size is not too large. For example the following code will raise an error because the expected size of the variable, 7GB, is deemed too large:
author_id = authors$AuthorId
#> Error in `$.hdd`(authors, AuthorId): Cannot extract variable AuthorId because its approximated size (2,341 MB) is greater than the cap of 1,000 MB. You can change the cap using setHdd_extract.cap(new_cap).
By default the cap at which this error is raised is 1GB. To drop the cap, just set setHdd_extract.cap(Inf)
, but then beware of memory issues!
Use the function readfst
to read hdd
files located on disk to memory. Of course the hdd
file should be small enough to fit in memory. An error will be raised if the expected size of the data exceeds the value of getHdd_extract.cap(new_cap)
(default is 1GB), which you can set with setHdd_extract.cap(new_cap)
. For example:
# to read the full data set into memory:
base_authors = readfst("_path/hdd_authors")
# Alternative way
authors_hdd = hdd("_path/hdd_authors")
base_authors = authors_hdd[]
Imagine you dispose of an in-memory data set to which you want to apply some function – say for instance that you will have to apply a cartesian merge. However the result of this function does not fit in memomy. The function hdd_slice
deals with it: it applies the function to slices of the original data, and save the results in a hdd
data set. You’ll be then able to deal with the result with hdd
.
Let’s have an example with a cartesian merge:
# x: the original data set
# y: the data set you want to merge to x
cartesian_merge = function(x){
merge(x, y, allow.cartesian = TRUE)
}
hdd_slice(x, fun = cartesian_merge,
dir = "_path/result_merge", chunkMB = 100)
Here the data x
will be split in 100MB chunks and the function cartesian_merge
will be applied to each of these chunks. The results will be saved in a hdd
data set located in _path/result_merge
. You’ll then be able to manipulate the data in _path/result_merge
as a regular hdd
data set.
This example involved a merging operation only, but you can apply any kind of function (for example x
can be a vector of text and the function can be the creation of ngrams, etc…).
Manipulating in-memory data will always be orders of mmagnitude faster than manipulating on-disk data. This comes from the simple fact that read/write operations on disk are about 100 times slower than read/write in RAM – further the read and write on disk also involves compression/decompression incurring increased CPU use. This is however the only way to deal with very large data sets (except of course if you have very deep pockets allowing you to have big RAM computers!).
This means that at the moment your final data set reaches a memory-workable size, stop using hdd
and start using regular R
. Package hdd
exists to make the transition from too-large-a-data-set to a memory-workable-data-set and is not intended to be a tool for regular data manipulation.
Since hdd
data sets are split into multiple files, the user cannot perform aggregate operations on some variable (i.e. using the by
clause in data.table
language) and obtain “valid” results. Indeed, the aggregate operations will be performed chunk per chunk and not on the entirety of the data set (which is not possible because of the size).
To circumvent this issue, the data set must be sorted by the variable(s) on which aggregation is done – in which case the chunk by chunk operations will be valid. To sort hdd
data sets, the function hdd_setkey
has been created – in particular it ensures that the keys do not spill across multiple files (to ensure consistency of the chunk by chunk aggregation). But beware, it is extremely slow (it involves multiple on-disk copying of the full data set).
On the Microsoft Academic Graph data:
Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW 15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839