Introducing ‘cwbtools’

Andreas Blätte (andreas.blaette@uni-due.de)

2022-08-29

Using the ‘cwbtools’ package

The cwbtools package offers a toolset to create, modify and manage corpora to be used with the Corpus Workbench (CWB) from within R. It supports the transition from data formats established by well-known R packages such as tm, quanteda or tidytext to a CWB corpus, so that the efficiency of an query system using indexed corpora (i.e. CWB) can be used.

Moving to CWB indexed corpora is worth considering when working with large corpora. The cwbtools package is designed to be able to work with large data. The core tool of the cwbtools package is the CorpusData-Class. It offers a standard workflow to import and process data, and to generate an indexed corpus. It is implemented as a R6 class. The advantage of using the reference semantics of the R6 class system is that there is minimal copying of the data. Handling memory parsimoniously is necessary when working with large corpora.

Further functions of the package support the creation, modification and management of structural and positional attributes of a corpus, and of registry files. See the manual to learn more about these functions. To learn more about the data structure of the CWB and the CWB jargon, the IMS Open Corpus Workbench (CWB) Corpus Encoding Tutorial is recommended as a point of entry.

The focus of this vignette is the creation of CWB indexed corpora. We present three scenarios: Creating a corpus (a) from XML, (b) from a tibble and (c) from the VCorpus used by the tm package.

A few preliminary preparations are necessary, however.

Getting Started

The CWB stores each indexed corpora in an individual data directory. Registry files in a ‘registry directory’ (or ‘registry’ in short) describe the indexed corpora and include the path to the data directory.

For the following examples, we create a temporary directory structure for registry files, and indexed corpora, making sure that these directories exits and are empty.

registry_tmp <- fs::path(tempdir(), "registry")
data_dir_tmp <- fs::path(tempdir(), "data_dir")

if (!file.exists(registry_tmp)){
  dir.create (registry_tmp)
} else {
  file.remove(list.files(registry_tmp, full.names = TRUE))
}
if (!file.exists(data_dir_tmp)) dir.create(data_dir_tmp)

regdir_envvar <- Sys.getenv("CORPUS_REGISTRY")
Sys.setenv(CORPUS_REGISTRY = registry_tmp)

In addition to cwbtools, we rely extensively on the data.table package. It is a good companion to cwbtools because of its efficiency to handle large data. Apart from the processing speed, it supports in-place modifications of data. When corpora grow large, it is advisable to omit copying the data in memory if not absolutely necessary.

library(cwbtools)
library(data.table)

Core operations on structural and positional attributes are implemented in “pure R”, but the CWB can be downloaded and stored within the package using the cwb_install() function. In the following examples, we will rely on the “pure R” approach.

Scenario 1: From XML to CWB - UN General Assembly Corpus

The first example is to turn a small sample corpus of debates in the UN General Assembly into an indexed corpus. The package includes some XML files that follow a standard suggested by the Text Encoding Initiative (TEI).

teidir <- system.file(package = "cwbtools", "xml", "UNGA")
teifiles <- list.files(teidir, full.names = TRUE)
list.files(teidir)
## [1] "N9986497.xml" "N9986515.xml" "N9986521.xml" "N9986551.xml" "N9986557.xml"
## [6] "N9986569.xml" "N9986587.xml"

For our example, we need an (empty) data directory for the UNGA corpus.

unga_data_dir_tmp <- fs::path(data_dir_tmp, "unga")
if (!file.exists(unga_data_dir_tmp)) dir.create(unga_data_dir_tmp)
file.remove(list.files(unga_data_dir_tmp, full.names = TRUE))

The point of departure then is to create a CorpusData object that will serve as a processor for these files and storage facility for the corpus data. The central fields of the class are named chunktable, tokenstream and metadata. When we inspect the new object at the outset, we will see that these fields are not filled initially.

UNGA <- CorpusData$new()
UNGA
## chunktable: NULL
## tokenstream: NULL
## metadata: NULL

To turn the XML files into the CorpusData object, we use the method$import_xml()`. To be able to add metadata from the header to the metadata table, the method requires a named vector of XPath expressions used to find the metadata within the XML document.

metadata <- c(
  lp = "//legislativePeriod", session = "//titleStmt/sessionNo",
  date = "//publicationStmt/date", url = "//sourceDesc/url",
  src = "//sourceDesc/filetype"
)
UNGA$import_xml(filenames = teifiles, meta = metadata)
UNGA
## chunktable: 2 columns / 1148 rows
## tokenstream: NULL
## metadata: 20 columns / 1148 rows

The input XML files are TEI files. Speaker information is included in the attributes of a tag named ‘sp’. To maintain the original content, there is a element ‘speaker’ in the document that includes the information on the speaker call without having parsed it. It is not text spoken in the debate, so we consider it as noise to be removed.

to_keep <- which(is.na(UNGA$metadata[["speaker"]]))
UNGA$chunktable <- UNGA$chunktable[to_keep]
UNGA$metadata <- UNGA$metadata[to_keep][, speaker := NULL]

We assign more telling column names …

setnames(UNGA$metadata, old = c("sp_who", "sp_state", "sp_role"), new = c("who", "state", "role"))

The CWB requires a tokenstream as input to generate positional attributes. NLP tools will offer lemmatization, part-of-speech-recognition and more. To keep things simple, we perform a very simple tokenization that relies on the tokenizers package.

UNGA$tokenize(lowercase = FALSE, strip_punct = FALSE)
UNGA

Let us see how it looks like now …

UNGA$tokenstream
##           id   word   cpos
##      1:    2      I      0
##      2:    2 should      1
##      3:    2   like      2
##      4:    2     to      3
##      5:    2 inform      4
##     ---                   
## 127078: 1148   rose 127077
## 127079: 1148     at 127078
## 127080: 1148      1 127079
## 127081: 1148    p.m 127080
## 127082: 1148      . 127081

The CorpusData object now includes a table with the token stream, and we are ready to import the corpus into the CWB. We use the $encode() method for this purpose. Note that the workers are the p_attribute_encode function (to encode the token stream), and the s_attribute_encode function (to encode the structural attributes / the metadata).

s_attrs <- c("id", "who", "state", "role", "lp", "session", "date")
UNGA$encode(
  registry_dir = registry_tmp, data_dir = unga_data_dir_tmp,
  corpus = "UNGA", encoding = "utf8", method = "R",
  p_attributes = "word", s_attributes = character(),
  compress = FALSE
  )
## === Makeall: processing corpus UNGA ===
## Registry directory: /var/folders/fw/qwt11pjx1qs83dl2jwltcvmr0000gn/T/RtmpCig61s/registry
## ATTRIBUTE word
##  + creating LEXSRT ... OK
##  - lexicon      OK
##  + creating FREQS ... OK
##  - frequencies  OK
##  - token stream OK
##  + creating REVCIDX ... OK
##  + creating REVCORP ... OK
##  ? validating REVCORP ... OK
##  - index        OK
## ========================================

In this example, the logical parameter compress is `TRUE. This triggers a so-called “huffcode”” compression, which will reduce corpus size and speed up queries. For big corpora, compression certainly makes sense, but it can be time-consuming. If you want to create a corpus experimentally and do not yet need optimization, performing the compression can be postponed or omitted.

The indexed corpus has now been prepared and can be used with CQP, CQPweb, or – if you like R – with a package such as polmineR. To see quickly whether everything has worked out as intended (and to keep installation requirements at a minimum), we use the low-level functionality of the RcppCWB package.

RcppCWB::cqp_initialize()
## [1] TRUE
id_peace <- RcppCWB::cl_str2id(
  corpus = "UNGA", p_attribute = "word",
  str = "peace", registry = registry_tmp
)
cpos_peace <- RcppCWB::cl_id2cpos(
  corpus = "UNGA", p_attribute = "word",
  id = id_peace, registry = registry_tmp
)

tab <- data.frame(
  i = unlist(lapply(1:length(cpos_peace), function(x) rep(x, times = 11))),
  cpos = unlist(lapply(cpos_peace, function(x) (x - 5):(x + 5)))
  )
tab[["id"]] <- RcppCWB::cl_cpos2id(
  corpus = "UNGA", p_attribute = "word",
  cpos = tab[["cpos"]], registry = registry_tmp
)
tab[["str"]] <- RcppCWB::cl_id2str(
  corpus = "UNGA", p_attribute = "word",
  id = tab[["id"]], registry = registry_tmp
)

peace_context <- split(tab[["str"]], as.factor(tab[["i"]]))
peace_context <- unname(sapply(peace_context, function(x) paste(x, collapse = " ")))
head(peace_context)
## [1] "the breast of a tranquil peace for all of the world"                         
## [2] "earth in the rapture of peace , justice and safety ."                        
## [3] "community are committed to achieving peace , security and prosperity for"    
## [4] "components of post - conflict peace - building , given the"                  
## [5] "removal and the consolidation of peace and mutual trust between neighbouring"
## [6] "and the promotion of durable peace and sustainable development in Africa"

Scenario 2: Tidytext to CWB

In a second scenario, we create a corpus of Jane Austen’s books (included in the package janeaustenr). Again, we start by preparing an (empty) data directory …

austen_data_dir_tmp <- fs::path(data_dir_tmp, "austen")
if (!file.exists(austen_data_dir_tmp)) dir.create(austen_data_dir_tmp)
file.remove(list.files(austen_data_dir_tmp, full.names = TRUE))

… and instantiating a CorpusData object.

Austen <- CorpusData$new()

The tidytext package offers an efficient workflow to create the token stream we can assign to the CorpusData object.

books <- janeaustenr::austen_books()
tbl <- tidytext::unnest_tokens(books, word, text, to_lower = FALSE)
Austen$tokenstream <- as.data.table(tbl)

To demonstrate that we can use an additional positional attribute, we stem the tokens using the SnowballC package.

Austen$tokenstream[, stem := SnowballC::wordStem(tbl[["word"]], language = "english")]

In this case, (zero-based!) corpus positions need to be assigned explicitly.

Austen$tokenstream[, cpos := 0L:(nrow(tbl) - 1L)]

We can now create the table with the structural attributes.

cpos_max_min <- function(x) list(cpos_left = min(x[["cpos"]]), cpos_right = max(x[["cpos"]]))
Austen$metadata <- Austen$tokenstream[, cpos_max_min(.SD), by = book]
Austen$metadata[, book := as.character(book)]
setcolorder(Austen$metadata, c("cpos_left", "cpos_right", "book"))

A few finishing touches and we have the token stream ready to be encoded: We remove the metadata from the table in the tokenstream field and order the columns nicely.

Austen$tokenstream[, book := NULL]
setcolorder(Austen$tokenstream, c("cpos", "word", "stem"))
Austen$tokenstream
##           cpos        word        stem
##      1:      0       SENSE       SENSE
##      2:      1         AND         AND
##      3:      2 SENSIBILITY SENSIBILITi
##      4:      3          by          by
##      5:      4        Jane        Jane
##     ---                               
## 725051: 725050          in          in
## 725052: 725051         its          it
## 725053: 725052    national      nation
## 725054: 725053  importance      import
## 725055: 725054       Finis        Fini

Ready to encode the corpus!

Austen$encode(
   corpus = "AUSTEN", encoding = "utf8",
   p_attributes = c("word", "stem"), s_attributes = "book",
   registry_dir = registry_tmp, data_dir = austen_data_dir_tmp,
   method = "R", compress = FALSE
)
## === Makeall: processing corpus AUSTEN ===
## Registry directory: /var/folders/fw/qwt11pjx1qs83dl2jwltcvmr0000gn/T/RtmpCig61s/registry
## ATTRIBUTE word
##  + creating LEXSRT ... OK
##  - lexicon      OK
##  + creating FREQS ... OK
##  - frequencies  OK
##  - token stream OK
##  + creating REVCIDX ... OK
##  + creating REVCORP ... OK
##  ? validating REVCORP ... OK
##  - index        OK
## ========================================
## === Makeall: processing corpus AUSTEN ===
## Registry directory: /var/folders/fw/qwt11pjx1qs83dl2jwltcvmr0000gn/T/RtmpCig61s/registry
## ATTRIBUTE stem
##  + creating LEXSRT ... OK
##  - lexicon      OK
##  + creating FREQS ... OK
##  - frequencies  OK
##  - token stream OK
##  + creating REVCIDX ... OK
##  + creating REVCORP ... OK
##  ? validating REVCORP ... OK
##  - index        OK
## ========================================

The new corpus will not be available unless we trigger that files in the registry directory are re-read.

RcppCWB::cqp_reset_registry(registry = registry_tmp)
## [1] TRUE

This is a rudimentary check (using low-level RcppCWB functions) whether to corpus can be used. How often does the token “pride” occur?

corpus <- "AUSTEN"
token <- "pride"
p_attr <- "word"
id <- RcppCWB::cl_str2id(corpus = corpus, p_attribute = p_attr, str = token, registry = registry_tmp)
cpos <- RcppCWB::cl_id2cpos(corpus = corpus, p_attribute = p_attr, id = id, registry = registry_tmp)
count <- length(cpos)
count
## [1] 113

Scenario 3: From tm-package VCorpus to CWB

In the third scenario, we will make the transition from a VCorpus (tm package) to a CWB indexed corpus. We use the Reuters corpus that is included as sample data in the tm package.

library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
reuters.tm <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

A quick way to attain the data format required for the CorpusData class is to coerce the VCorpus to a tibble, using respective functionality of the tidytext package.

library(tidytext)
reuters.tbl <- tidy(reuters.tm)
reuters.tbl
## # A tibble: 20 × 17
##    author   datetimestamp       description heading id    language origin topics
##    <chr>    <dttm>              <chr>       <chr>   <chr> <chr>    <chr>  <chr> 
##  1 <NA>     1987-02-26 17:00:56 ""          DIAMON… 127   en       Reute… YES   
##  2 BY TED … 1987-02-26 17:34:11 ""          OPEC M… 144   en       Reute… YES   
##  3 <NA>     1987-02-26 18:18:00 ""          TEXACO… 191   en       Reute… YES   
##  4 <NA>     1987-02-26 18:21:01 ""          MARATH… 194   en       Reute… YES   
##  5 <NA>     1987-02-26 19:00:57 ""          HOUSTO… 211   en       Reute… YES   
##  6 <NA>     NA                  ""          KUWAIT… 236   en       Reute… YES   
##  7 By Jere… NA                  ""          INDONE… 237   en       Reute… YES   
##  8 <NA>     NA                  ""          SAUDI … 242   en       Reute… YES   
##  9 <NA>     NA                  ""          QATAR … 246   en       Reute… YES   
## 10 <NA>     NA                  ""          SAUDI … 248   en       Reute… YES   
## 11 <NA>     NA                  ""          SAUDI … 273   en       Reute… YES   
## 12 <NA>     NA                  ""          GULF A… 349   en       Reute… YES   
## 13 <NA>     NA                  ""          SAUDI … 352   en       Reute… YES   
## 14 <NA>     NA                  ""          KUWAIT… 353   en       Reute… YES   
## 15 <NA>     NA                  ""          PHILAD… 368   en       Reute… YES   
## 16 <NA>     NA                  ""          STUDY … 489   en       Reute… YES   
## 17 <NA>     NA                  ""          STUDY … 502   en       Reute… YES   
## 18 <NA>     NA                  ""          UNOCAL… 543   en       Reute… YES   
## 19 By BERN… NA                  ""          NYMEX … 704   en       Reute… YES   
## 20 <NA>     NA                  ""          ARGENT… 708   en       Reute… YES   
## # … with 9 more variables: lewissplit <chr>, cgisplit <chr>, oldid <chr>,
## #   topics_cat <named list>, places <named list>, people <chr>, orgs <chr>,
## #   exchanges <chr>, text <chr>

Columns with the metadata we may want to encode are still character vectors (length > 1). We turn the columns with the topic categorizations and the places into length-one character vectors.

reuters.tbl[["topics_cat"]] <- sapply(
  reuters.tbl[["topics_cat"]],
  function(x) paste(x, collapse = "|")
  )
reuters.tbl[["places"]] <- sapply(
  reuters.tbl[["places"]],
  function(x) paste(x, collapse = "|")
  )

This is the input we need. We instantiate a CorpusData object end make sure that a new data directory for the REUTERS corpus exists (and is empty) …

Reuters <- CorpusData$new()
reuters_data_dir_tmp <- fs::path(data_dir_tmp, "reuters")
if (!file.exists(reuters_data_dir_tmp)) dir.create(reuters_data_dir_tmp)
file.remove(list.files(reuters_data_dir_tmp, full.names = TRUE))

We assign the chunktable and the metadata, coerced to a data.table, to the object …

Reuters$chunktable <- data.table(reuters.tbl[, c("id", "text")])
Reuters$metadata <- data.table(reuters.tbl[,c("id", "topics_cat", "places")])
Reuters
## chunktable: 2 columns / 20 rows
## tokenstream: NULL
## metadata: 3 columns / 20 rows

… we tokenize the text to gain the token stream table …

Reuters$tokenize()

… we have a look at the token stream …

Reuters$tokenstream
##        id         word cpos
##    1: 127      diamond    0
##    2: 127     shamrock    1
##    3: 127         corp    2
##    4: 127         said    3
##    5: 127         that    4
##   ---                      
## 4046: 708  yacimientos 4045
## 4047: 708 petroliferos 4046
## 4048: 708     fiscales 4047
## 4049: 708        added 4048
## 4050: 708       reuter 4049

It seems that we can encode the corpus.

Reuters$encode(
   corpus = "REUTERS", encoding = "utf8",
   p_attributes = "word", s_attributes = c("topics_cat", "places"),
   registry_dir = registry_tmp,
   data_dir = data_dir_tmp,
   method = "R", compress = FALSE
)
## === Makeall: processing corpus REUTERS ===
## Registry directory: /var/folders/fw/qwt11pjx1qs83dl2jwltcvmr0000gn/T/RtmpCig61s/registry
## ATTRIBUTE word
##  + creating LEXSRT ... OK
##  - lexicon      OK
##  + creating FREQS ... OK
##  - frequencies  OK
##  - token stream OK
##  + creating REVCIDX ... OK
##  + creating REVCORP ... OK
##  ? validating REVCORP ... OK
##  - index        OK
## ========================================

The corpus will not be available unless we call cqp_reset_registry() so that files in the registry directory are reloaded.

RcppCWB::cqp_reset_registry(registry = registry_tmp)

The Reuters corpus is about oil production in the Middle East. To check quickly, whether it works, we count the number of occurrences of the word “oil”.

id <- RcppCWB::cl_str2id(corpus = "REUTERS", p_attribute = "word", str = "oil", registry = registry_tmp)
cpos <- RcppCWB::cl_id2cpos(corpus = "REUTERS", p_attribute = "word", id = id, registry = registry_tmp)
count <- length(cpos)
count
## [1] 86

Final Deeds and Words

As a matter of housekeeping, we remove the temporary directory.

unlink(registry_tmp, recursive = TRUE)
unlink(data_dir_tmp, recursive = TRUE)

Sys.setenv(CORPUS_REGISTRY = regdir_envvar)

Hope everything works! Enjoy! Feedback is welcome!