This document shows the updated version 3 of the package, now available on CRAN
We start by using the MediaWiki API to grab page data from Wikipedia. We will wrap this up into a small function for re-use later, and start by looking at the English page for oenguins. The code converts the JSON data into XML data and takes only text within the body of the article.
<- function(lang, page) {
grab_wiki <- sprintf(
url "https://%s.wikipedia.org/w/api.php?action=parse&format=json&page=%s",
lang,
page)<- jsonlite::fromJSON(url)$parse$text$"*"
page_json <- xml2::read_xml(page_json, asText=TRUE)
page_xml <- xml_text(xml_find_all(page_xml, "//div/p"))
page_text
<- stri_replace_all(page_text, "", regex="\\[[0-9]+\\]")
page_text <- stri_replace_all(page_text, " ", regex="\n")
page_text <- stri_replace_all(page_text, " ", regex="[ ]+")
page_text <- page_text[stri_length(page_text) > 10]
page_text
return(page_text)
}
<- grab_wiki("en", "penguin")
penguin 1:10] # just show the first 10 paragraphs penguin[
[1] "Penguins (order Sphenisciformes, family Spheniscidae) are a group of aquatic flightless birds. They live almost exclusively in the Southern Hemisphere, with only one species, the Galápagos penguin, found north of the equator. Highly adapted for life in the water, penguins have countershaded dark and white plumage and flippers for swimming. Most penguins feed on krill, fish, squid and other forms of sea life which they catch while swimming underwater. They spend roughly half of their lives on land and the other half in the sea. "
[2] "Although almost all penguin species are native to the Southern Hemisphere, they are not found only in cold climates, such as Antarctica. In fact, only a few species of penguin live so far south. Several species are found in the temperate zone, but one species, the Galápagos penguin, lives near the equator. "
[3] "The largest living species is the emperor penguin (Aptenodytes forsteri): on average, adults are about 1.1 m (3 ft 7 in) tall and weigh 35 kg (77 lb). The smallest penguin species is the little blue penguin (Eudyptula minor), also known as the fairy penguin, which stands around 40 cm (16 in) tall and weighs 1 kg (2.2 lb). Among extant penguins, larger penguins inhabit colder regions, while smaller penguins are generally found in temperate or even tropical climates. Some prehistoric species attained enormous sizes, becoming as tall or as heavy as an adult human. These were not restricted to Antarctic regions; on the contrary, subantarctic regions harboured high diversity, and at least one giant penguin occurred in a region around 2,000 km south of the equator 35 mya, in a climate decidedly warmer than today.[which?] "
[4] "The word penguin first appears in the 16th century as a synonym for great auk. When European explorers discovered what are today known as penguins in the Southern Hemisphere, they noticed their similar appearance to the great auk of the Northern Hemisphere, and named them after this bird, although they are not closely related. "
[5] "The etymology of the word penguin is still debated. The English word is not apparently of French, Breton or Spanish origin (the latter two are attributed to the French word pingouin \"auk\"), but first appears in English or Dutch. "
[6] "Some dictionaries suggest a derivation from Welsh pen, \"head\" and gwyn, \"white\", including the Oxford English Dictionary, the American Heritage Dictionary, the Century Dictionary and Merriam-Webster, on the basis that the name was originally applied to the great auk, either because it was found on White Head Island (Welsh: Pen Gwyn) in Newfoundland, or because it had white circles around its eyes (though the head was black). "
[7] "An alternative etymology links the word to Latin pinguis, which means \"fat\" or \"oil\". Support for this etymology can be found in the alternative Germanic word for penguin, Fettgans or \"fat-goose\", and the related Dutch word vetgans. "
[8] "Adult male penguins are called cocks, females hens; a group of penguins on land is a waddle, and a similar group in the water is a raft. "
[9] "Since 1871, the Latin word Pinguinus has been used in scientific classification to name the genus of the great auk (Pinguinus impennis, meaning \"penguin without flight feathers\"), which became extinct in the mid-19th century. As confirmed by a 2004 genetic study, the genus Pinguinus belongs in the family of the auks (Alcidae), within the order of the Charadriiformes. "
[10] "The birds currently known as penguins were discovered later and were so named by sailors because of their physical resemblance to the great auk. Despite this resemblance, however, they are not auks and they are not closely related to the great auk. They do not belong in the genus Pinguinus, and are not classified in the same family and order as the great auks. They were classified in 1831 by Bonaparte in several distinct genera within the family Spheniscidae and order Sphenisciformes. "
Next, we run the udpipe annotation backend over the dataset using cleanNLP. Because of the way the data are structured, each paragraph will be treated as its own document.
cnlp_init_udpipe()
<- cnlp_annotate(penguin, verbose=FALSE)
anno $token anno
# A tibble: 5,519 x 11
doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
* <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 1 Peng… "Penguins " Peng… NOUN NNS Numb… 11
2 1 1 2 ( "(" ( PUNCT -LRB- NA 1
3 1 1 3 order "order " order NOUN NN Numb… 4
4 1 1 4 Sphe… "Spheniscifo… Sphe… NOUN NNS Numb… 1
5 1 1 5 , ", " , PUNCT , NA 7
6 1 1 6 fami… "family " fami… NOUN NN Numb… 7
7 1 1 7 Sphe… "Spheniscida… Sphe… NOUN NN Numb… 4
8 1 1 8 ) ") " ) PUNCT -RRB- NA 1
9 1 1 9 are "are " be AUX VBP Mood… 11
10 1 1 10 a "a " a DET DT Defi… 11
# … with 5,509 more rows, and 1 more variable: relation <chr>
Here, we will show how we can recreate the original text, possibly with additional markings. This can be useful when building text-based visualization pipelines. For example, let’s start by replacing all of the proper nouns with an all caps version of each word. This is easy because udpipe (and spacy as well) provides a column called token_with_ws
:
<- anno$token
token $new_token <- token$token_with_ws
token<- which(token$xpos %in% c("NNP", "NNPS"))
change_these $new_token[change_these] <- stri_trans_toupper(token$new_token[change_these]) token
Then, push all of the text back together by paragraph (we use the stri_wrap
function to print out the text in a nice format for this document):
<- tapply(token$new_token, token$doc_id, paste, collapse="")[1:10]
paragraphs <- stri_wrap(paragraphs, simplify=FALSE, exdent = 1)
paragraphs cat(unlist(lapply(paragraphs, function(v) c(v, ""))), sep="\n")
Penguins (order Sphenisciformes, family Spheniscidae) are a group of
aquatic flightless birds. They live almost exclusively in the Southern
Hemisphere, with only one species, the GALÁPAGOS PENGUIN, found north
of the equator. Highly adapted for life in the water, penguins have
countershaded dark and white plumage and flippers for swimming. Most
penguins feed on krill, fish, squid and other forms of sea life which
they catch while swimming underwater. They spend roughly half of their
lives on land and the other half in the sea.
Although almost all penguin species are native to the Southern
Hemisphere, they are not found only in cold climates, such as
ANTARCTICA. In fact, only a few species of penguin live so far south.
Several species are found in the temperate zone, but one species, the
GALÁPAGOS penguin, lives near the equator.
The largest living species is the emperor penguin (Aptenodytes
forsteri): on average, adults are about 1.1 m (3 ft 7 in) tall and weigh
35 kg (77 lb). The smallest penguin species is the little blue penguin
(EUDYPTULA MINOR), also known as the fairy penguin, which stands around
40 cm (16 in) tall and weighs 1 kg (2.2 lb). Among extant penguins,
larger penguins inhabit colder regions, while smaller penguins are
generally found in temperate or even tropical climates. Some prehistoric
species attained enormous sizes, becoming as tall or as heavy as an
adult human. These were not restricted to Antarctic regions; on the
contrary, subantarctic regions harboured high diversity, and at least
one giant penguin occurred in a region around 2,000 km south of the
equator 35 mya, in a climate decidedly warmer than today.[which?]
The word penguin first appears in the 16th century as a synonym for
great auk. When European explorers discovered what are today known
as penguins in the SOUTHERN Hemisphere, they noticed their similar
appearance to the great auk of the Northern Hemisphere, and named them
after this bird, although they are not closely related.
The etymology of the word penguin is still debated. The English word
is not apparently of FRENCH, BRETON or Spanish origin (the latter two
are attributed to the French word pingouin "auk"), but first appears in
ENGLISH or DUTCH.
Some dictionaries suggest a derivation from WELSH PEN, "head" and gwyn,
"white", including the OXFORD ENGLISH DICTIONARY, the American Heritage
DICTIONARY, the CENTURY DICTIONARY and MERRIAM-WEBSTER, on the basis
that the name was originally applied to the great auk, either because
it was found on WHITE HEAD ISLAND (Welsh: PEN GWYN) in NEWFOUNDLAND,
or because it had white circles around its eyes (though the head was
black).
An alternative etymology links the word to LATIN PINGUIS, which
means "fat" or "oil". Support for this etymology can be found in the
alternative Germanic word for penguin, Fettgans or "fat-goose", and the
related Dutch word vetgans.
Adult male penguins are called cocks, females hens; a group of penguins
on land is a waddle, and a similar group in the water is a raft.
Since 1871, the Latin word PINGUINUS has been used in scientific
classification to name the genus of the great auk (Pinguinus impennis,
meaning "penguin without flight feathers"), which became extinct in
the mid-19th century. As confirmed by a 2004 genetic study, the genus
Pinguinus belongs in the family of the auks (ALCIDAE), within the order
of the Charadriiformes.
The birds currently known as penguins were discovered later and were
so named by sailors because of their physical resemblance to the
great auk. Despite this resemblance, however, they are not auks and
they are not closely related to the great auk. They do not belong in
the genus Pinguinus, and are not classified in the same family and
order as the great auks. They were classified in 1831 by BONAPARTE
in several distinct genera within the family Spheniscidae and order
Sphenisciformes.
By outputting the text as HTML or XML, there are a lot of interesting visualization and metadata work that can be done with this approach. If you have an interesting use case that might be useful to others, please feel free to make a pull-request to include your work in the package repository.