
Project Status: Active – The project has reached a stable, usable state and is being actively developed. cran checks Build Status codecov rstudio mirror downloads cran version

Get chunks of XML articles

Package API

The main workhorse function is pub_chunks(). It allows you to pull out sections of articles from many different publishers (see next section below) WITHOUT having to know how to parse/navigate XML. XML has a steep learning curve, and can require quite a bit of Googling to sort out how to get to different parts of an XML document.

The other main function is pub_tabularize() - which takes the output of pub_chunks() and coerces into a data.frame for easier downstream processing.

Supported publishers/sources

If you know of other publishers or sources that provide XML let us know by opening an issue.

We’ll continue adding additional publishers.


Stable version


Development version from GitHub


Load library


Working with files

x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml", 
  package = "pubchunks")
pub_chunks(x, "abstract")
#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: abstract
#>   showing up to first 5: 
#>    abstract (n=1): Abstract
#>                   This pa ...
pub_chunks(x, "title")
#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...
pub_chunks(x, "authors")
#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: authors
#>   showing up to first 5: 
#>    authors (n=1): Chetaev, D.N
pub_chunks(x, c("title", "refs"))
#> <pub chunks>
#>   from: file
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title, refs
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...
#>    refs (n=6): 1.G.N.WatsonTeoriia besselevykh funktsiiTheory of

The output of pub_chunks() is a list with an S3 class pub_chunks to make internal work in the package easier. You can easily see the list structure by using unclass().

Working with the xml already in a string

xml <- paste0(readLines(x), collapse = "")
pub_chunks(xml, "title")
#> <pub chunks>
#>   from: character
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...

Working with xml2 class object

xml <- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
pub_chunks(xml, "title")
#> <pub chunks>
#>   from: xml_document
#>   publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
#>   sections: title
#>   showing up to first 5: 
#>    title (n=1): On the driving of a piston with a rigid collar int ...

Working with output of fulltext::ft_get()

x <- fulltext::ft_get('10.1371/journal.pone.0086169', from='plos')
pub_chunks(fulltext::ft_collect(x), sections="authors")
#> $plos
#> $plos$`10.1371/journal.pone.0086169`
#> <pub chunks>
#>   from: xml_document
#>   publisher/journal: plos/PLoS ONE
#>   sections: authors
#>   showing up to first 5: 
#>    authors (n=4): nested list
#> attr(,"ft_data")
#> [1] TRUE

Coerce pub_chunks output into data.frame’s

x <- system.file("examples/elife_1.xml", package = "pubchunks")
res <- pub_chunks(x, c("doi", "title", "keywords"))
#>                   doi                                          title
#> 1 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
#> 2 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
#> 3 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
#> 4 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
#> 5 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
#> 6 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
#>                       keywords .publisher
#> 1                     microRNA      elife
#> 2            nonsense mutation      elife
#> 3 nonsense-mediated mRNA decay      elife
#> 4                          APC      elife
#> 5             intron retention      elife
#> 6  premature termination codon      elife

Get a random XML article


res <- cr_works(filter = list(
    full_text_type = "application/xml", 
links <- bind_rows(res$data$link) %>% filter(content.type == "application/xml")
download.file(links$URL[1], (i <- tempfile(fileext = ".xml")))
#> <pub chunks>
#>   from: file
#>   publisher/journal: unknown/NA
#>   sections: all
#>   showing up to first 5: 
#>    front (n=0): 
#>    body (n=0): 
#>    back (n=0): 
#>    title (n=0): 
#>    doi (n=0):
download.file(links$URL[13], (j <- tempfile(fileext = ".xml")))
#> <pub chunks>
#>   from: file
#>   publisher/journal: pensoft/ZooKeys
#>   sections: all
#>   showing up to first 5: 
#>    front (n=3): nested list
#>    body (n=31): The thermal spring Khakusy is located one kilomete
#>    back (n=2): nested list
#>    title (n=1): Description of a new species Gyraulus (Pulmonata:  ...
#>    doi (n=1): 10.3897/zookeys.762.23661
download.file(links$URL[20], (k <- tempfile(fileext = ".xml")))
#> <pub chunks>
#>   from: file
#>   publisher/journal: pensoft/ZooKeys
#>   sections: all
#>   showing up to first 5: 
#>    front (n=3): nested list
#>    body (n=34): Approximately 2300 species of scorpions have been 
#>    back (n=2): nested list
#>    title (n=1): A new Sky Island species of Vaejovis C. L. Koch, 1 ...
#>    doi (n=1): 10.3897/zookeys.760.22714

