The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:
gutenberg_download()
that downloads one or more works from Project Gutenberg by ID: e.g., gutenberg_download(84)
downloads the text of Frankenstein.gutenberg_metadata
contains information about each work, pairing Gutenberg ID with title, author, language, etcgutenberg_authors
contains information about each author, such as aliases and birth/death yeargutenberg_subjects
contains pairings of works with Library of Congress subjects and topicsThis package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.
The dataset gutenberg_metadata
contains information about each work, pairing Gutenberg ID with title, author, language, etc:
library(gutenbergr)
gutenberg_metadata
## # A tibble: 51,997 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 0 <NA> <NA> NA en <NA> Publi…
## 2 1 "The … Jeffer… 1638 en United States L… Publi…
## 3 2 "The … United… 1 en American Revolu… Publi…
## 4 3 "John… Kenned… 1666 en <NA> Publi…
## 5 4 "Linc… Lincol… 3 en US Civil War Publi…
## 6 5 "The … United… 1 en American Revolu… Publi…
## 7 6 "Give… Henry,… 4 en American Revolu… Publi…
## 8 7 "The … <NA> NA en <NA> Publi…
## 9 8 "Abra… Lincol… 3 en US Civil War Publi…
## 10 9 "Abra… Lincol… 3 en US Civil War Publi…
## # … with 51,987 more rows, and 1 more variable: has_text <lgl>
For example, you could find the Gutenberg ID of Wuthering Heights by doing:
library(dplyr)
%>%
gutenberg_metadata filter(title == "Wuthering Heights")
## # A tibble: 1 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_booksh… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 768 Wuthe… Brontë… 405 en Gothic Fiction/M… Publi…
## # … with 1 more variable: has_text <lgl>
In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works()
function does this pre-filtering:
gutenberg_works()
## # A tibble: 40,737 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 0 <NA> <NA> NA en <NA> Publi…
## 2 1 "The … Jeffer… 1638 en United States L… Publi…
## 3 2 "The … United… 1 en American Revolu… Publi…
## 4 3 "John… Kenned… 1666 en <NA> Publi…
## 5 4 "Linc… Lincol… 3 en US Civil War Publi…
## 6 5 "The … United… 1 en American Revolu… Publi…
## 7 6 "Give… Henry,… 4 en American Revolu… Publi…
## 8 7 "The … <NA> NA en <NA> Publi…
## 9 8 "Abra… Lincol… 3 en US Civil War Publi…
## 10 9 "Abra… Lincol… 3 en US Civil War Publi…
## # … with 40,727 more rows, and 1 more variable: has_text <lgl>
It also allows you to perform filtering as an argument:
gutenberg_works(author == "Austen, Jane")
## # A tibble: 10 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 105 "Persu… Auste… 68 en <NA> Publi…
## 2 121 "North… Auste… 68 en Gothic Fiction Publi…
## 3 141 "Mansf… Auste… 68 en <NA> Publi…
## 4 158 "Emma" Auste… 68 en <NA> Publi…
## 5 161 "Sense… Auste… 68 en <NA> Publi…
## 6 946 "Lady … Auste… 68 en <NA> Publi…
## 7 1212 "Love … Auste… 68 en <NA> Publi…
## 8 1342 "Pride… Auste… 68 en Best Books Ever… Publi…
## 9 31100 "The C… Auste… 68 en <NA> Publi…
## 10 42078 "The L… Auste… 68 en <NA> Publi…
## # … with 1 more variable: has_text <lgl>
# or with a regular expression
library(stringr)
gutenberg_works(str_detect(author, "Austen"))
## # A tibble: 13 x 8
## gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
## <int> <chr> <chr> <int> <chr> <chr> <chr>
## 1 105 "Persu… Auste… 68 en <NA> Publi…
## 2 121 "North… Auste… 68 en Gothic Fiction Publi…
## 3 141 "Mansf… Auste… 68 en <NA> Publi…
## 4 158 "Emma" Auste… 68 en <NA> Publi…
## 5 161 "Sense… Auste… 68 en <NA> Publi…
## 6 946 "Lady … Auste… 68 en <NA> Publi…
## 7 1212 "Love … Auste… 68 en <NA> Publi…
## 8 1342 "Pride… Auste… 68 en Best Books Ever… Publi…
## 9 17797 "Memoi… Auste… 7603 en <NA> Publi…
## 10 31100 "The C… Auste… 68 en <NA> Publi…
## 11 33513 "The F… Auste… 36446 en <NA> Publi…
## 12 39897 "Disco… Layar… 40288 en <NA> Publi…
## 13 42078 "The L… Auste… 68 en <NA> Publi…
## # … with 1 more variable: has_text <lgl>
The meta-data currently in the package was last updated on 05 May 2016.
The function gutenberg_download()
downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that “Wuthering Heights” has ID 768 (see the URL here), so gutenberg_download(768)
downloads this text.
<- system.file("extdata", "768.zip", package = "gutenbergr")
f768 <- gutenberg_download(768,
wuthering_heights files = f768,
mirror = "http://aleph.gutenberg.org")
<- gutenberg_download(768) wuthering_heights
wuthering_heights
## # A tibble: 12,085 x 2
## gutenberg_id text
## <int> <chr>
## 1 768 "WUTHERING HEIGHTS"
## 2 768 ""
## 3 768 ""
## 4 768 "CHAPTER I"
## 5 768 ""
## 6 768 ""
## 7 768 "1801.--I have just returned from a visit to my landlord--the s…
## 8 768 "neighbour that I shall be troubled with. This is certainly a …
## 9 768 "country! In all England, I do not believe that I could have f…
## 10 768 "situation so completely removed from the stir of society. A p…
## # … with 12,075 more rows
Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id
(useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.
Provide a vector of IDs to download multiple books. For example, to download Jane Eyre (book 1260) along with Wuthering Heights, do:
<- system.file("extdata", "1260.zip", package = "gutenbergr")
f1260 <- gutenberg_download(c(768, 1260),
books meta_fields = "title",
files = c(f768, f1260),
mirror = "http://aleph.gutenberg.org")
<- gutenberg_download(c(768, 1260), meta_fields = "title") books
books
## # A tibble: 32,744 x 3
## gutenberg_id text title
## <int> <chr> <chr>
## 1 768 "WUTHERING HEIGHTS" Wuthering H…
## 2 768 "" Wuthering H…
## 3 768 "" Wuthering H…
## 4 768 "CHAPTER I" Wuthering H…
## 5 768 "" Wuthering H…
## 6 768 "" Wuthering H…
## 7 768 "1801.--I have just returned from a visit to my la… Wuthering H…
## 8 768 "neighbour that I shall be troubled with. This is… Wuthering H…
## 9 768 "country! In all England, I do not believe that I… Wuthering H…
## 10 768 "situation so completely removed from the stir of … Wuthering H…
## # … with 32,734 more rows
Notice that the meta_fields
argument allows us to add one or more additional fields from the gutenberg_metadata
to the downloaded text, such as title or author.
%>%
books count(title)
## # A tibble: 2 x 2
## title n
## <chr> <int>
## 1 Jane Eyre: An Autobiography 20659
## 2 Wuthering Heights 12085
You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects
contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:
gutenberg_subjects
## # A tibble: 140,173 x 3
## gutenberg_id subject_type subject
## <int> <chr> <chr>
## 1 1 lcc E201
## 2 1 lcsh United States. Declaration of Independence
## 3 1 lcsh United States -- History -- Revolution, 1775-1783 …
## 4 1 lcc JK
## 5 2 lcc KF
## 6 2 lcsh Civil rights -- United States -- Sources
## 7 2 lcsh United States. Constitution. 1st-10th Amendments
## 8 2 lcc JK
## 9 3 lcsh Presidents -- United States -- Inaugural addresses
## 10 3 lcsh United States -- Foreign relations -- 1961-1963
## # … with 140,163 more rows
This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id
column can then be used to download these texts or to link with other metadata.
%>%
gutenberg_subjects filter(subject == "Detective and mystery stories")
## # A tibble: 521 x 3
## gutenberg_id subject_type subject
## <int> <chr> <chr>
## 1 170 lcsh Detective and mystery stories
## 2 173 lcsh Detective and mystery stories
## 3 244 lcsh Detective and mystery stories
## 4 305 lcsh Detective and mystery stories
## 5 330 lcsh Detective and mystery stories
## 6 481 lcsh Detective and mystery stories
## 7 547 lcsh Detective and mystery stories
## 8 863 lcsh Detective and mystery stories
## 9 905 lcsh Detective and mystery stories
## 10 1155 lcsh Detective and mystery stories
## # … with 511 more rows
%>%
gutenberg_subjects filter(grepl("Holmes, Sherlock", subject))
## # A tibble: 47 x 3
## gutenberg_id subject_type subject
## <int> <chr> <chr>
## 1 108 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 2 221 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 3 244 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 4 834 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 5 1661 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 6 2097 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 7 2343 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 8 2344 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 9 2345 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## 10 2346 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
## # … with 37 more rows
gutenberg_authors
contains information about each author, such as aliases and birth/death year:
gutenberg_authors
## # A tibble: 16,236 x 7
## gutenberg_author… author alias birthdate deathdate wikipedia aliases
## <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 1 United S… <NA> NA NA <NA> <NA>
## 2 3 Lincoln,… <NA> 1809 1865 http://en.… United St…
## 3 4 Henry, P… <NA> 1736 1799 http://en.… <NA>
## 4 5 Adam, Pa… <NA> NA NA <NA> <NA>
## 5 7 Carroll,… Dodgs… 1832 1898 http://en.… <NA>
## 6 8 United S… <NA> NA NA <NA> Agency, U…
## 7 9 Melville… Melvi… 1819 1891 http://en.… <NA>
## 8 10 Barrie, … Barri… 1860 1937 http://en.… <NA>
## 9 12 Smith, J… Smith… 1805 1844 http://en.… <NA>
## 10 14 Madison,… Unite… 1751 1836 http://en.… <NA>
## # … with 16,226 more rows
What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.
library(tidytext)
<- books %>%
words unnest_tokens(word, text)
words
## # A tibble: 305,532 x 3
## gutenberg_id title word
## <int> <chr> <chr>
## 1 768 Wuthering Heights wuthering
## 2 768 Wuthering Heights heights
## 3 768 Wuthering Heights chapter
## 4 768 Wuthering Heights i
## 5 768 Wuthering Heights 1801
## 6 768 Wuthering Heights i
## 7 768 Wuthering Heights have
## 8 768 Wuthering Heights just
## 9 768 Wuthering Heights returned
## 10 768 Wuthering Heights from
## # … with 305,522 more rows
<- words %>%
word_counts anti_join(stop_words, by = "word") %>%
count(title, word, sort = TRUE)
word_counts
## # A tibble: 21,200 x 3
## title word n
## <chr> <chr> <int>
## 1 Wuthering Heights heathcliff 421
## 2 Wuthering Heights linton 346
## 3 Jane Eyre: An Autobiography jane 342
## 4 Wuthering Heights catherine 336
## 5 Jane Eyre: An Autobiography rochester 317
## 6 Jane Eyre: An Autobiography sir 315
## 7 Jane Eyre: An Autobiography miss 310
## 8 Jane Eyre: An Autobiography time 244
## 9 Jane Eyre: An Autobiography day 232
## 10 Jane Eyre: An Autobiography looked 221
## # … with 21,190 more rows
You may also find these resources useful:
wikipedia
column in gutenberg_author
to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend packageformat_reverse
function for reversing “Last, First” names).