The data preparation part of any Natural Language Processing flow consists of a number of important steps: Tokenization (1), Parts of Speech tagging (2), Lemmatization (3) and Dependency Parsing (4). This package allows you to do out-of-the-box annotation of these 4 steps and also allows you to train your own annotator models directly from R.
It does this by providing an Rcpp wrapper around the UDPipe C++ library which is described at https://ufal.mff.cuni.cz/udpipe and is available at https://github.com/ufal/udpipe.
The udpipe R package was designed with the following things in mind when building the Rcpp wrapper around the UDPipe C++ library:
Before you can start on performing the annotation, you need a model. Pre-trained models build on Universal Dependencies treebanks are made available for more than 65 languages based on 101 treebanks, namely:
afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, classical_chinese-kyoto, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, estonian-ewt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, italian-vit, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-alksnis, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, old_russian-torot, persian-seraji, polish-lfg, polish-pdb, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb, wolof-wtb.
For R users who want to use these open-sourced models provided by the UDPipe community and start on tagging, you can proceed as follows to download the model of the language of your choice.
'data.frame': 1 obs. of 5 variables:
$ language : chr "dutch-alpino"
$ file_model : chr "/tmp/RtmpTMVRiO/Rbuild2b026e5efba1e4/udpipe/vignettes/dutch-alpino-ud-2.5-191206.udpipe"
$ url : chr "https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/dutch-alpino-"| __truncated__
$ download_failed : logi FALSE
$ download_message: chr "OK"
The udipe R package also allows you to easily train your own models, based on data in CONLL-U format, so that you can use these for your own commercial or non-commercial purposes. This is described in the other vignette of this package which you can view by the command vignette("udpipe-train", package = "udpipe")
`
Currently the package allows you to do tokenisation, tagging, lemmatization and dependency parsing with one convenient function called udpipe_annotate
. This goes as follows.
First load the model which you have downloaded or which you have stored somewhere on disk.
Once you have this model, you can start on annotating. Provide a vector of text and use udpipe_annotate
. The resulting tagged output is in CONLL-U format as described at https://universaldependencies.org/format.html. You can put this in a data.frame format with as.data.frame
.
txt <- c("Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt? Jazeker meneer",
"Het gaat vooruit, het gaat verbazend goed vooruit")
x <- udpipe_annotate(udmodel_dutch, x = txt)
x <- as.data.frame(x)
str(x)
'data.frame': 27 obs. of 14 variables:
$ doc_id : chr "doc1" "doc1" "doc1" "doc1" ...
$ paragraph_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ sentence_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ sentence : chr "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" "Ik ben de weg kwijt, kunt u me zeggen waar de Lange Wapper ligt?" ...
$ token_id : chr "1" "2" "3" "4" ...
$ token : chr "Ik" "ben" "de" "weg" ...
$ lemma : chr "ik" "zijn" "de" "weg" ...
$ upos : chr "PRON" "AUX" "DET" "NOUN" ...
$ xpos : chr "VNW|pers|pron|nomin|vol|1|ev" "WW|pv|tgw|ev" "LID|bep|stan|rest" "N|soort|ev|basis|zijd|stan" ...
$ feats : chr "Case=Nom|Person=1|PronType=Prs" "Number=Sing|Tense=Pres|VerbForm=Fin" "Definite=Def" "Gender=Com|Number=Sing" ...
$ head_token_id: chr "5" "5" "4" "5" ...
$ dep_rel : chr "nsubj" "cop" "det" "obj" ...
$ deps : chr NA NA NA NA ...
$ misc : chr NA NA NA NA ...
ADJ ADV AUX DET NOUN PRON PUNCT VERB
4 3 2 2 4 5 3 4
Mark that by default udpipe_annotate
does Tokenization, Parts of Speech Tagging, Lemmatization and Dependency parsing. If you want to gain some time because you require only a part of the annotation, you can specify to leave parts of the annotation out. This is done as follows.
## Tokenization + finds sentences, does not execute POS tagging, nor lemmatization or dependency parsing
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "none", parser = "none")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)
## Tokenization + finds sentences, does POS tagging and lemmatization but does not execute dependency parsing
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "default", parser = "none")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)
## Tokenization + finds sentences and executes dependency parsing but does not do POS tagging nor lemmatization
x <- udpipe_annotate(udmodel_dutch, x = txt, tagger = "none", parser = "default")
x <- as.data.frame(x)
table(x$upos)
table(x$dep_rel)
If your data is already tokenised according to your needs using other tools like the tidytext / tokenizers / text2vec R packages or any other external software or just by manual work. You can still use udpipe to do parts of speech annotation and dependency parsing and skip the tokenisation. This is done as follows.
## Either put every token on a new line and use tokenizer: vertical
input <- list(doc1 = c("Ik", "ben", "de", "weg", "kwijt", ",", "kunt", "u", "me", "zeggen",
"waar", "de", "Lange Wapper", "ligt", "?", "Jazeker", "meneer"),
doc2 = c("Het", "gaat", "vooruit", ",", "het", "gaat", "verbazend", "goed", "vooruit"))
txt <- sapply(input, FUN=function(x) paste(x, collapse = "\n"))
x <- udpipe_annotate(udmodel_dutch, x = txt, tokenizer = "vertical")
x <- as.data.frame(x)
## Or put every token of each document in 1 string separated by a space and use tokenizer: horizontal
## Mark that if a token contains a space, you need to replace the space
## with the 'NO-BREAK SPACE' (U+00A0) character to make sure it is still considered as one token
txt <- sapply(input, FUN=function(x){
x <- gsub(" ", intToUtf8(160), x) ## replace space with no-break-space
paste(x, collapse = " ")
})
x <- udpipe_annotate(udmodel_dutch, x = as.character(txt), tokenizer = "horizontal")
x <- as.data.frame(x)
Some remarks:
doc_id
to udpipe_annotate
so that you can link your document to the tagged terms later onudpipe_annotate
, if you don’t have that Encoding use standard R facilities like iconv
to convert it to UTF-8. You get also results in UTF-8 encoding back.dl <- udpipe_download_model(language = "sanskrit", udpipe_model_repo = "jwijffels/udpipe.models.ud.2.0")
udmodel_sanskrit <- udpipe_load_model(file = dl$file_model)
txt <- "ततः असौ प्राह क्षत्रियस्य तिस्रः भार्या धर्मम् भवन्ति तत् एषा कदाचिद् वैश्या सुता भविष्यति तत् अनुरागः ममास्याम् ततः रथकारः तस्य निश्चयम् विज्ञायावदत् वयस्य किम् अ धुना कर्तव्यम् कौलिकः आह किम् अहम् जानामि त्वयि मित्रे यत् अभिहितं मया ततः"
x <- udpipe_annotate(udmodel_sanskrit, x = txt)
Encoding(x$conllu)
x <- as.data.frame(x)
Need support in text mining. Contact BNOSAC: http://www.bnosac.be