Authors: David Robinson
License: GPL-2
Download and process public domain works from the Project Gutenberg collection. Includes
gutenberg_download() that downloads one or more works from Project Gutenberg by ID: e.g., gutenberg_download(84) downloads the text of Frankenstein.gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etcgutenberg_authors contains information about each author, such as aliases and birth/death yeargutenberg_subjects contains pairings of works with Library of Congress subjects and topicsInstall the package with:
Or install the development version using devtools with:
The gutenberg_works() function retrieves, by default, a table of metadata for all unique English-language Project Gutenberg works that have text associated with them. (The gutenberg_metadata dataset has all Gutenberg works, unfiltered).
Suppose we wanted to download Emily Bronte’s “Wuthering Heights.” We could find the book’s ID by filtering:
library(dplyr)
library(gutenbergr)
gutenberg_works() %>%
filter(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#> gutenberg_id title author gutenberg_author_id language
#> <int> <chr> <chr> <int> <chr>
#> 1 768 Wuthering Heights Brontë, Emily 405 en
#> gutenberg_bookshelf rights has_text
#> <chr> <chr> <lgl>
#> 1 Gothic Fiction/Movie Books/Best Books Ever Listings Public domain in the USA. TRUE
# or just:
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 x 8
#> gutenberg_id title author gutenberg_author_id language
#> <int> <chr> <chr> <int> <chr>
#> 1 768 Wuthering Heights Brontë, Emily 405 en
#> gutenberg_bookshelf rights has_text
#> <chr> <chr> <lgl>
#> 1 Gothic Fiction/Movie Books/Best Books Ever Listings Public domain in the USA. TRUESince we see that it has gutenberg_id 768, we can download it with the gutenberg_download() function:
wuthering_heights <- gutenberg_download(768)
wuthering_heights
#> # A tibble: 12,085 x 2
#> gutenberg_id text
#> <int> <chr>
#> 1 768 WUTHERING HEIGHTS
#> 2 768 ""
#> 3 768 ""
#> 4 768 CHAPTER I
#> 5 768 ""
#> 6 768 ""
#> 7 768 1801.--I have just returned from a visit to my landlord--the solitary
#> 8 768 neighbour that I shall be troubled with. This is certainly a beautiful
#> 9 768 country! In all England, I do not believe that I could have fixed on a
#> 10 768 situation so completely removed from the stir of society. A perfect
#> # … with 12,075 more rowsgutenberg_download can download multiple books when given multiple IDs. It also takes a meta_fields argument that will add variables from the metadata.
# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
#> # A tibble: 32,744 x 3
#> gutenberg_id text
#> <int> <chr>
#> 1 768 WUTHERING HEIGHTS
#> 2 768 ""
#> 3 768 ""
#> 4 768 CHAPTER I
#> 5 768 ""
#> 6 768 ""
#> 7 768 1801.--I have just returned from a visit to my landlord--the solitary
#> 8 768 neighbour that I shall be troubled with. This is certainly a beautiful
#> 9 768 country! In all England, I do not believe that I could have fixed on a
#> 10 768 situation so completely removed from the stir of society. A perfect
#> title
#> <chr>
#> 1 Wuthering Heights
#> 2 Wuthering Heights
#> 3 Wuthering Heights
#> 4 Wuthering Heights
#> 5 Wuthering Heights
#> 6 Wuthering Heights
#> 7 Wuthering Heights
#> 8 Wuthering Heights
#> 9 Wuthering Heights
#> 10 Wuthering Heights
#> # … with 32,734 more rows
books %>%
count(title)
#> # A tibble: 2 x 2
#> title n
#> <chr> <int>
#> 1 Jane Eyre: An Autobiography 20659
#> 2 Wuthering Heights 12085It can also take the output of gutenberg_works directly. For example, we could get the text of all Aristotle’s works, each annotated with both gutenberg_id and title, using:
aristotle_books <- gutenberg_works(author == "Aristotle") %>%
gutenberg_download(meta_fields = "title")
aristotle_books
#> # A tibble: 39,950 x 3
#> gutenberg_id text
#> <int> <chr>
#> 1 1974 THE POETICS OF ARISTOTLE
#> 2 1974 ""
#> 3 1974 By Aristotle
#> 4 1974 ""
#> 5 1974 A Translation By S. H. Butcher
#> 6 1974 ""
#> 7 1974 ""
#> 8 1974 [Transcriber's Annotations and Conventions: the translator left
#> 9 1974 intact some Greek words to illustrate a specific point of the original
#> 10 1974 discourse. In this transcription, in order to retain the accuracy of
#> title
#> <chr>
#> 1 The Poetics of Aristotle
#> 2 The Poetics of Aristotle
#> 3 The Poetics of Aristotle
#> 4 The Poetics of Aristotle
#> 5 The Poetics of Aristotle
#> 6 The Poetics of Aristotle
#> 7 The Poetics of Aristotle
#> 8 The Poetics of Aristotle
#> 9 The Poetics of Aristotle
#> 10 The Poetics of Aristotle
#> # … with 39,940 more rowswikipedia column in gutenberg_author to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend package.format_reverse function for reversing “Last, First” names).See the data-raw directory for the scripts that generate these datasets. As of now, these were generated from the Project Gutenberg catalog on 05 May 2016.
Yes! The package respects these rules and complies to the best of our ability. Namely:
http://www.gutenberg.lib.md.us/8/84/84.zip.Still, this package is not the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow their recommendation to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic.
This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.