Abstract
The R packagekoRpus
aims to be a versatile tool for text analysis, with an emphasis on scientific research on that topic. It implements dozens of formulae to measure readability and lexical diversity. On a more basic level koRpus
can be used as an R wrapper for third party products, like the tokenizer and POS tagger TreeTagger or language corpora of the Leipzig Corpora Collection. This vignette takes a brief tour around its core components, shows how they can be used and gives some insight on design decisions.
Work on koRpus
started in February 2011, primarily with the goal in mind to examine how similar different texts are. Since then, it quickly grew into an R package which implements dozens of formulae for readability and lexical diversity, and wrappers for language corpus databases and a tokenizer/POS tagger.
At the very beginning of almost every analysis with this package, the text you want to examine has to be sliced into its components, and the components must be identified and named. That is, it has to be split into its semantic parts (tokens), words, numbers, punctuation marks. After that, each token will be tagged regarding its part-of-speech (POS). For both of these steps, koRpus
can use the third party software TreeTagger (Schmid, 1994).
Especially for Windows users installation of TreeTagger might be a little more complex – e.g., it depends on Perl1, and you need a tool to extract .tar.gz archives.2 Detailed installations instructions are beyond the scope of this vignette.
If you don’t want to use TreeTagger, koRpus
provides a simple tokenizer of its own called tokenize()
. While the tokenizing itself works quite well, tokenize()
is not as elaborate as is TreeTagger when it comes to POS tagging, as it can merely tell words from numbers, punctuation and abbreviations. Although this is sufficient for most readability formulae, you can’t evaluate word classes in detail. If that’s what you want, a TreeTagger installation is needed.
Some of the readability formulae depend on special word lists (like Bormuth, 1968; Dale & Chall, 1948; Spache, 1953). For copyright reasons these lists are not included as of now. This means, as long as you don’t have copies of these lists, you can’t calculate these particular measures, but of course all others. The expected format to use a list with this package is a simple text file with one word per line, preferably in UTF-8 encoding.
The frequency analysis functions in this package can look up how often each word in a text is used in its language, given that a corpus database is provided. Databases in Celex format are support, as is the Leipzig Corpora Collection (Quasthoff, Richter, & Biemann, 2006) file format. To use such a database with this package, you simply need to download one of the .zip/.tar files.
If you want to estimate the language of a text, reference texts in known languages are needed. In koRpus
, the Universal Declaration of Human Rights with its more than 350 translations is used.
From now on it is assumed that the above requirements are correctly installed and working. If an optional component is used it will be noted. Further, we’ll need a sample text to analyze. We’ll use the section on defense mechanisms of Phasmatodea from Wikipedia for this purpose.
In order to do some analysis, you need to load a language support package for each language you would like to work with. For instance, in this vignette we’re analyzing an English sample text. Language support packages for koRpus
are named koRpus.lang.**
, where **
is a two-character ID for the respective language, like en
for English.3
# install the language support package
install.koRpus.lang("en")
# load the package
library(koRpus.lang.en)
When koRpus
itself is loaded, it will list you all language packages found on your system. To get a list of all installable packages, call available.koRpus.lang()
.
As explained earlier, splitting the text up into its basic components can be done by TreeTagger. To achieve this and have the results available in R, the function treetag()
is used.
treetag()
At the very least you must provide it with the text, of course, and name the language it is written in. In addition to that you must specify where you installed TreeTagger. If you look at the package documentation you’ll see that treetag()
understands a number of options to configure TreeTagger, but in most cases using one of the built-in presets should suffice. TreeTagger comes with batch/shell scripts for installed languages, and the presets of treetag()
are basically just R implementations of these scripts.
tagged.text <- treetag(
"sample_text.txt",
treetagger="manual",
lang="en",
TT.options=list(
path="~/bin/treetagger/",
preset="en"
),
doc_id="sample"
)
The first argument (file name) and lang
should explain themselves. The treetagger
option can either take the full path to one of the original TreeTagger scripts mentioned above, or the keyword “manual”, which will cause the interpretation of what is defined by TT.options
. To use a preset, just put the path
to your local TreeTagger installation and a valid preset
name here.4 The document ID is optional and can be omitted.
The resulting S4 object is of a class called kRp.text
. If you call the object directly you get a shortened view of it’s main content:
## doc_id token tag lemma lttr wclass desc stop stem idx sntc
## 1 sample Defense NN defense 7 noun <NA> <NA> <NA> 1 1
## 2 sample mechanisms NNS mechanism 10 noun <NA> <NA> <NA> 2 1
## 3 sample Phasmatodea NP <unknown> 11 name <NA> <NA> <NA> 3 1
## 4 sample species NN species 7 noun <NA> <NA> <NA> 4 1
## 5 sample exhibit NN exhibit 7 noun <NA> <NA> <NA> 5 1
## 6 sample mechanisms NNS mechanism 10 noun <NA> <NA> <NA> 6 1
## [...]
## 612 sample considered VBN consider 10 verb <NA> <NA> <NA> 612 18
## 613 sample inedible JJ inedible 8 adjective <NA> <NA> <NA> 613 18
## 614 sample by IN by 2 preposition <NA> <NA> <NA> 614 18
## 615 sample some DT some 4 determiner <NA> <NA> <NA> 615 18
## 616 sample predators NNS predator 9 noun <NA> <NA> <NA> 616 18
## 617 sample . SENT . 1 fullstop <NA> <NA> <NA> 617 18
Once you’ve come this far, i.e., having a valid object of class kRp.text
, all following analyses should run smoothly.
If treetag()
should fail, you should first re-run it with the extra option debug=TRUE
. Most interestingly, that will print the contents of sys.tt.call
, which is the TreeTagger command given to your operating system for execution. With that it should be possible to examine where exactly the erroneous behavior starts.
tokenize()
If you don’t need detailed word class analysis, you should be fine using koRpus
’ own function tokenize()
. As you can see, tokenize()
comes to the same results regarding the tokens, but is rather limited in recognizing word classes:
## doc_id token tag lemma lttr wclass desc stop stem idx sntc
## 1 sample Defense word.kRp 7 word <NA> <NA> <NA> 1 1
## 2 sample mechanisms word.kRp 10 word <NA> <NA> <NA> 2 1
## 3 sample Phasmatodea word.kRp 11 word <NA> <NA> <NA> 3 1
## 4 sample species word.kRp 7 word <NA> <NA> <NA> 4 1
## 5 sample exhibit word.kRp 7 word <NA> <NA> <NA> 5 1
## 6 sample mechanisms word.kRp 10 word <NA> <NA> <NA> 6 1
## [...]
## 620 sample considered word.kRp 10 word <NA> <NA> <NA> 620 20
## 621 sample inedible word.kRp 8 word <NA> <NA> <NA> 621 20
## 622 sample by word.kRp 2 word <NA> <NA> <NA> 622 20
## 623 sample some word.kRp 4 word <NA> <NA> <NA> 623 20
## 624 sample predators word.kRp 9 word <NA> <NA> <NA> 624 20
## 625 sample . .kRp 1 fullstop <NA> <NA> <NA> 625 20
koRpus
objectsFor this class of objects, koRpus
provides some comfortable methods to extract the portions you’re interested in. For example, the main results are to be found in the slot tokens
. In addition to TreeTagger’s original output (token, tag and lemma) treetag()
also automatically counts letters and assigns tokens to global word classes. To get these results as a data.frame, use the getter method taggedText()
:
## doc_id token tag lemma lttr wclass desc stop stem idx sntc
## 26 sample and CC and 3 conjunction NA NA NA 26 1
## 27 sample are VBP be 3 verb NA NA NA 27 1
## 28 sample deployed VBN deploy 8 verb NA NA NA 28 1
## 29 sample after IN after 5 preposition NA NA NA 29 1
## 30 sample an DT an 2 determiner NA NA NA 30 1
## 31 sample attack NN attack 6 noun NA NA NA 31 1
## 32 sample has VBZ have 3 verb NA NA NA 32 1
## 33 sample been VBN be 4 verb NA NA NA 33 1
## 34 sample initiated VBN initiate 9 verb NA NA NA 34 1
In case you want to access a subset of the data in the resulting object, e.g., only the column with number of letters or the first five rows of tokens
, you’ll be happy to know there’s special [
and [[
methods for these kinds of objects:
## [1] 7 10 11 7 7 10 3 7 4 9 4 4 7 2 6 4 9 2 3 5 5 1 7 7 1 3 3
## [28] 8 5 2 6 3 4 9 1 9 8 1 3 7 9 4 7 12 4 11 2 10 1 4
## doc_id token tag lemma lttr wclass desc stop stem idx sntc
## 1 sample Defense NN defense 7 noun NA NA NA 1 1
## 2 sample mechanisms NNS mechanism 10 noun NA NA NA 2 1
## 3 sample Phasmatodea NP <unknown> 11 name NA NA NA 3 1
## 4 sample species NN species 7 noun NA NA NA 4 1
## 5 sample exhibit NN exhibit 7 noun NA NA NA 5 1
The [
and [[
methods are basically a useful shortcut replacements for taggedText()
.
All results of both treetag()
and tokenize()
also provide various descriptive statistics calculated from the analyzed text. You can get them by calling describe()
on the object:
## $all.chars
## [1] 3554
##
## $lines
## [1] 10
##
## $normalized.space
## [1] 3549
##
## $chars.no.space
## [1] 2996
##
## $punct
## [1] 78
##
## $digits
## [1] 4
##
## $letters
## all l1 l2 l3 l4 l5 l6 l7 l8 l9 l10 l11 l12 l13 l14 l15 l16
## 2918 19 92 74 80 51 49 65 43 35 22 15 6 3 0 1 1
##
## $letters.only
## [1] 2914
##
## $char.distrib
## 1 2 3 4 5 6 7
## num 80.00000 92.00000 74.00000 80.00000 51.000000 49.000000 65.00000
## cum.sum 80.00000 172.00000 246.00000 326.00000 377.000000 426.000000 491.00000
## cum.inv 537.00000 445.00000 371.00000 291.00000 240.000000 191.000000 126.00000
## pct 12.96596 14.91086 11.99352 12.96596 8.265802 7.941653 10.53485
## cum.pct 12.96596 27.87682 39.87034 52.83630 61.102107 69.043760 79.57861
## pct.inv 87.03404 72.12318 60.12966 47.16370 38.897893 30.956240 20.42139
## 8 9 10 11 12 13
## num 43.000000 35.000000 22.000000 15.000000 6.0000000 3.0000000
## cum.sum 534.000000 569.000000 591.000000 606.000000 612.0000000 615.0000000
## cum.inv 83.000000 48.000000 26.000000 11.000000 5.0000000 2.0000000
## pct 6.969206 5.672609 3.565640 2.431118 0.9724473 0.4862237
## cum.pct 86.547812 92.220421 95.786062 98.217180 99.1896272 99.6758509
## pct.inv 13.452188 7.779579 4.213938 1.782820 0.8103728 0.3241491
## 14 15 16
## num 0.0000000 1.0000000 1.0000000
## cum.sum 615.0000000 616.0000000 617.0000000
## cum.inv 2.0000000 1.0000000 0.0000000
## pct 0.0000000 0.1620746 0.1620746
## cum.pct 99.6758509 99.8379254 100.0000000
## pct.inv 0.3241491 0.1620746 0.0000000
##
## $lttr.distrib
## 1 2 3 4 5 6 7
## num 19.000000 92.00000 74.00000 80.00000 51.000000 49.00000 65.00000
## cum.sum 19.000000 111.00000 185.00000 265.00000 316.000000 365.00000 430.00000
## cum.inv 537.000000 445.00000 371.00000 291.00000 240.000000 191.00000 126.00000
## pct 3.417266 16.54676 13.30935 14.38849 9.172662 8.81295 11.69065
## cum.pct 3.417266 19.96403 33.27338 47.66187 56.834532 65.64748 77.33813
## pct.inv 96.582734 80.03597 66.72662 52.33813 43.165468 34.35252 22.66187
## 8 9 10 11 12 13
## num 43.000000 35.000000 22.000000 15.000000 6.0000000 3.0000000
## cum.sum 473.000000 508.000000 530.000000 545.000000 551.0000000 554.0000000
## cum.inv 83.000000 48.000000 26.000000 11.000000 5.0000000 2.0000000
## pct 7.733813 6.294964 3.956835 2.697842 1.0791367 0.5395683
## cum.pct 85.071942 91.366906 95.323741 98.021583 99.1007194 99.6402878
## pct.inv 14.928058 8.633094 4.676259 1.978417 0.8992806 0.3597122
## 14 15 16
## num 0.0000000 1.0000000 1.0000000
## cum.sum 554.0000000 555.0000000 556.0000000
## cum.inv 2.0000000 1.0000000 0.0000000
## pct 0.0000000 0.1798561 0.1798561
## cum.pct 99.6402878 99.8201439 100.0000000
## pct.inv 0.3597122 0.1798561 0.0000000
##
## $words
## [1] 556
##
## $sentences
## [1] 18
##
## $avg.sentc.length
## [1] 30.88889
##
## $avg.word.length
## [1] 5.248201
##
## $doc_id
## [1] "sample"
Amongst others, you will find several indices describing the number of characters:
all.chars
: Counts each character, including all space charactersnormalized.space
: Like all.chars
, but clusters of space characters (incl. line breaks) are counted only as one characterchars.no.space
: Counts all characters except any space charactersletters.only
: Counts only letters, excluding(!) digits (which are counted seperately as digits
)You’ll also find the number of words
and sentences
, as well as average word and sentence lengths, and tables describing how the word length is distributed throughout the text (lttr.distrib
). For instance, we see that the text has 74 words with three letters, 185 with three or less, and 371 with more than three. The last three lines show the percentages, respectively.
To analyze the lexical diversity of our text we can now simply hand over the tagged text object to the lex.div()
method. You can call it on the object with no further arguments (like lex.div(tagged.text)
), but in this example we’ll limit the analysis to a few measures:5
lex.div(
tagged.text,
measure=c("TTR", "MSTTR", "MATTR","HD-D", "MTLD", "MTLD-MA"),
char=c("TTR", "MATTR","HD-D", "MTLD", "MTLD-MA")
)
##
## Total number of tokens: 556
## Total number of types: 294
## Total number of lemmas: 283
##
## Type-Token Ratio
## TTR: 0.53
##
## TTR characteristics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5297 0.5466 0.5930 0.6188 0.6491 1.0000
## SD
## 0.0907
##
##
## Mean Segmental Type-Token Ratio
## MSTTR: 0.72
## SD of TTRs: 0.03
## Segment size: 100
## Tokens dropped: 56
##
## Hint: A segment size of 92 would reduce the drop rate to 4.
## Maybe try ?segment.optimizer()
##
##
## Moving-Average Type-Token Ratio
## MATTR: 0.74
## SD of TTRs: 0.03
## Window size: 100
##
## MATTR characteristics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7138 0.7239 0.7308 0.7290 0.7341 0.7368
## SD
## 0.0066
##
##
## HD-D
## HD-D: 35.54
## ATTR: 0.85
## Sample size: 42
##
## HD-D characteristics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 35.46 35.62 34.31 35.82 36.16
## SD
## 5.0648
##
##
## Measure of Textual Lexical Diversity
## MTLD: 97.5
## Number of factors: NA
## Factor size: 0.72
## SD tokens/factor: 36.08 (all factors)
## 26.06 (complete factors only)
##
## MTLD characteristics:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.00 79.07 88.37 83.21 92.92 104.40 1
## SD
## 15.9015
##
##
## Moving-Average Measure of Textual Lexical Diversity
## MTLD-MA: 102.73
## SD tokens/factor: 26.74
## Step size: 1
## Factor size: 0.72
## Min. tokens: 9
##
## MTLD-MA characteristics:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 63.00 92.99 96.57 95.89 102.76 107.99 12
## SD
## 9.6766
##
## Note: Analysis was conducted case insensitive.
Let’s look at some particular parts: At first we are informed of the total number of types, tokens and lemmas (if available). After that the actual results are being printed, using the package’s show()
method for this particular kind of object. As you can see, it prints the actual value of each measure before a summary of the characteristics.6
Some measures return more information than just their actual index value. For instance, when the Mean Segmental Type-Token Ratio is calculated, you’ll be informed how much of your text was dropped and hence not examined. A small feature tool of koRpus
, segment.optimizer()
, automatically recommends you with a different segment size if this could decrease the number of lost tokens.
By default, lex.div()
calculates every measure of lexical diversity that was implemented. Of course this is fully configurable, e.g. to completely skip the calculation of characteristics just add the option char=NULL
. If you’re only interested in one particular measure, it might be more convenient to call the according wrapper function instead of lex.div()
. For example, to calculate only the measures proposed by Maas (1972):
## Language: "en"
##
## Total number of tokens: 556
## Total number of types: 294
## Total number of lemmas: 283
##
## Maas' Indices
## a: 0.19
## lgV0: 5.64
## lgeV0: 12.99
##
## Relative vocabulary growth (first half to full text)
## a: 0.81
## lgV0: 6.75
## V': 0.43 (43 new types every 100 tokens)
##
## Note: Analysis was conducted case insensitive.
All wrapper functions have characteristics turned off by default. The following example demonstrates how to calculate and plot the classic type-token ratio with characteristics. The resulting plot shows the typical degredation of TTR values with increasing text length:
ttr.res <- TTR(tagged.text, char=TRUE)
plot(ttr.res@TTR.char, type="l", main="TTR degredation over text length")
Since this package is intended for research, it is possible to directly influence all relevant values of each measure and examine the effects. For example, as mentioned before segment.optimizer()
recommended a change of segment size for MSTTR to drop less words, which is easily done:
## Language: "en"
##
## Total number of tokens: 556
## Total number of types: 294
## Total number of lemmas: 283
##
## Mean Segmental Type-Token Ratio
## MSTTR: 0.75
## SD of TTRs: 0.04
## Segment size: 92
## Tokens dropped: 4
##
## Note: Analysis was conducted case insensitive.
Please see to the documentation for more detailed information on the available measures and their references.
This package has rudimentary support to import corpus databases.7 That is, it can read frequency data for words into an R object and use this object for further analysis. Next to the Celex database format (read.corp.celex()
), it can read the LCC flatfile format8 (read.corp.LCC()
). The latter might be of special interest, because the needed database archives can be freely downloaded. Once you’ve downloaded one of these archives, it can be comfortably imported:
read.corp.LCC()
will automatically extract the files it needs from the archive. Alernatively, you can specify the path to the unpacked archive as well. To work with the imported data directly, the tool query()
was added to the package. It helps you to comfortably look up certain words, or ranges of interesting values:
## num word freq pct pmio log10 rank.avg rank.min rank.rel.avg
## 160 210 what 16396 0.000780145 780 2.892095 260759 260759 99.95362
## rank.rel.min
## 160 99.95362
## num word freq pct pmio log10 rank.avg rank.min rank.rel.avg
## 156 206 many 16588 0.0007892806 789 2.897077 260763 260763 99.95515
## 157 207 per 16492 0.0007847128 784 2.894316 260762 260762 99.95477
## 158 208 down 16468 0.0007835708 783 2.893762 260761 260761 99.95439
## 159 209 since 16431 0.0007818103 781 2.892651 260760 260760 99.95400
## 160 210 what 16396 0.0007801450 780 2.892095 260759 260759 99.95362
## rank.rel.min
## 156 99.95515
## 157 99.95477
## 158 99.95439
## 159 99.95400
## 160 99.95362
We can now conduct a full frequency analysis of our text:
The resulting object holds a lot of information, even if no corpus data was used (i.e., corp.freq=NULL
). To begin with, it contains the two slots tokens
and lang
, which are copied from the analyzed tagged text object. In this way analysis results can always be converted back into kRp.text
objects.9 However, if corpus data was provided, the tagging results gained three new columns:
## token tag lemma lttr [...] pmio rank.avg rank.min
[...]
## 30 an DT an 2 3817 99.98735 99.98735
## 31 attack NN attack 6 163 99.70370 99.70370
## 32 has VBZ have 3 4318 99.98888 99.98888
## 33 been VBN be 4 2488 99.98313 99.98313
## 34 initiated VBN initiate 9 11 97.32617 97.32137
## 35 ( ( ( 1 854 99.96013 99.96013
## 36 secondary JJ secondary 9 21 98.23846 98.23674
## 37 defense NN defense 7 210 99.77499 99.77499
## 38 ) ) ) 1 856 99.96052 99.96052
[...]
Perhaps most informatively, pmio
shows how often the respective token appears in a million tokens, according to the corpus data. Adding to this, the previously introduced slot desc
now contains some more descriptive statistics on our text, and if we provided a corpus database, the slot freq.analysis
lists summaries of various frequency information that was calculated.
If the corpus object also provided inverse document frequency (i.e., values in column idf
) data, freq.analysis()
will automatically compute tf-idf statistics and put them in a column called tfidf
.
desc
slotAmongst others, the descriptives now also give easy access to character vectors with all words ($all.words
) and all lemmata ($all.lemmata
), all tokens sorted10 into word classes (e.g., all verbs in $classes$verb
), or the number of words in each sentece:
## [1] 34 10 37 16 44 31 14 31 34 23 17 43 40 47 22 19 65 29
As a practical example, the list $classes
has proven to be very helpful to debug the results of TreeTagger, which is remarkably accurate, but of course not free from making a mistake now and then. By looking through $classes
, where all tokens are grouped regarding to the global word class TreeTagger attributed to it, at least obvious errors (like names mistakenly taken for a pronoun) are easily found:11
## $conjunction
## [1] "both" "and" "and" "and" "and" "or" "or" "and" "and" "or"
## [11] "and" "or" "and" "or" "and" "and" "and" "and"
##
## $number
## [1] "20" "one"
##
## $determiner
## [1] "an" "the" "an" "The" "the" "the" "some"
## [8] "that" "Some" "the" "a" "a" "a" "the"
## [15] "that" "the" "the" "Another" "which" "the" "a"
## [22] "that" "a" "The" "a" "the" "that" "a"
[...]
The package comes with implementations of several readability formulae. Some of them depend on the number of syllables in the text.12 To achieve this, the method hyphen()
takes objects of class kRp.text
and applies an hyphenation algorithm (Liang, 1983) to each word. This algorithm was originally developed for automatic word hyphenation in \(\LaTeX\), and is gracefully misused here to fulfill a slightly different service.13
## syll word
## 1 2 De-fense
## 2 3 mech-a-nisms
## 3 4 Phasm-a-to-dea
## 4 2 spe-cies
## 5 3 ex-hib-it
## 6 3 mech-a-nisms
## NA [...]
## 551 1 is
## 552 3 con-sid-ered
## 553 4 in-ed-i-ble
## 554 1 by
## 555 1 some
## 556 3 pred-a-tors
This seperate hyphenation step can actually be skipped, as readability()
will do it automatically if needed. But similar to TreeTagger, hyphen()
will most likely not produce perfect results. As a rule of thumb, if in doubt it seems to behave rather conservative, that is, is underestimates the real number of syllables in a text. This, however, would of course affect the results of several readability formulae.
So, the more accurate the end results should be, the less you should rely on the automatic hyphenation alone. But it sure is a good starting point, for there is a method called correct.hyph()
to help you clean these results of errors later on. The most straight forward way to do this is to call hyphenText(hyph.txt.en)
, which will get you a data frame with two colums, word
(the hyphenated words) and syll
(the number of syllables), in a spread sheet editor:14
## syll word
## 1 2 De-fense
## 2 3 mech-a-nisms
## 3 4 Phasm-a-to-dea
## 4 2 spe-cies
## 5 3 ex-hib-it
## 6 3 mech-a-nisms
You can then manually correct wrong hyphenations by removing or inserting “-” as hyphenation indicators, and call correct.hyph()
without further arguments, which will cause it to recount all syllables:
But the method can also be used to alter entries directly, which might be simpler and cleaner than manual changes:
## Changed
##
## syll word
## 2 3 mech-a-nisms
## 6 3 mech-a-nisms
##
## into
##
## syll word
## 2 4 mech-a-ni-sms
## 6 4 mech-a-ni-sms
The hyphenated text object can now be given to readability()
, to calculate the measures of interest:15
Similar to lex.div()
, by default readability()
calculates almost16 all available measures:
##
## Automated Readability Index (ARI)
## Parameters: default
## Grade: 18.73
##
##
## Coleman Formulas
## Parameters: default
## Pronouns: 1.62 (per 100 words)
## Prepos.: 13.49 (per 100 words)
## Formula 1: 39% cloze completions
## Formula 2: 37% cloze completions
## Formula 3: 35% cloze completions
## Formula 4: 36% cloze completions
##
##
## Coleman-Liau
## Parameters: default
## ECP: 33% (estimted cloze percentage)
## Grade: 14.1
## Grade: 14.1 (short formula)
##
##
## Danielson-Bryan
## Parameters: default
## DB1: 9.86
## DB2: 26.39
## Grade: >= 13 (college)
##
##
## Dickes-Steiwer's Handformel
## Parameters: default
## TTR: 0.53
## Score: 32.21
##
##
## Easy Listening Formula
## Parameters: default
## Exsyls: 222
## Score: 12.33
##
##
## Farr-Jenkins-Paterson
## Parameters: default
## RE: 33.19
## Grade: >= 13 (college)
##
##
## Flesch Reading Ease
## Parameters: en (Flesch)
## RE: 33.98
## Grade: >= 13 (college)
##
##
## Flesch-Kincaid Grade Level
## Parameters: default
## Grade: 16.19
## Age: 21.19
##
##
## Gunning Frequency of Gobbledygook (FOG)
## Parameters: default
## Grade: 18.69
##
##
## FORCAST
## Parameters: default
## Grade: 10.99
## Age: 15.99
##
##
## Fucks' Stilcharakteristik
## Score: 51.67
## Grade: 7.19
##
##
## Gutiérrez Fórmula de Comprensibilidad
## Score: 33.48
##
##
## Linsear Write
## Parameters: default
## Easy words: 80.4
## Hard words: 19.6
## Grade: 21.5
##
##
## Läsbarhetsindex (LIX)
## Parameters: default
## Index: 65.24
## Rating: very difficult
## Grade: > 11
##
##
## Neue Wiener Sachtextformeln
## Parameters: default
## nWS 1: 10.57
## nWS 2: 11.07
## nWS 3: 10.58
## nWS 4: 11.89
##
##
## Readability Index (RIX)
## Parameters: default
## Index: 10.61
## Grade: > 12 (college)
##
##
## Simple Measure of Gobbledygook (SMOG)
## Parameters: default
## Grade: 17.19
## Age: 22.19
##
##
## Strain Index
## Parameters: default
## Index: 15.5
##
##
## Tränkle-Bailer Formulas
## Parameters: default
## Prepositions: 13%
## Conjunctions: 3%
## TB 1: 18.59
## TB 2: 27.15
##
##
## Kuntzsch's Text-Redundanz-Index
## Parameters: default
## Short words: 334
## Punctuation: 78
## Foreign: 0
## Score: -56.88
##
##
## Tuldava's Text Difficulty Formula
## Parameters: default
## Index: 5.74
##
##
## Wheeler-Smith
## Parameters: default
## Score: 123.33
## Grade: > 4
##
## Text language: en
To get a more condensed overview of the results try the summary()
method:
## Text language: en
## index flavour raw grade age
## 1 ARI 18.73
## 2 Coleman C1 39
## 3 Coleman C2 37
## 4 Coleman C3 35
## 5 Coleman C4 36
## 6 Coleman-Liau 33 14.1
## 7 Danielson-Bryan DB1 9.86
## 8 Danielson-Bryan DB2 26.39 >= 13 (college)
## 9 Dickes-Steiwer 32.21
## 10 ELF 12.33
## 11 Farr-Jenkins-Paterson 33.19 >= 13 (college)
## 12 Flesch en (Flesch) 33.98 >= 13 (college)
## 13 Flesch-Kincaid 16.19 21.2
## 14 FOG 18.69
## 15 FORCAST 10.99 16
## 16 Fucks 51.67 7.19
## 17 Gutierrez 33.48
## 18 Linsear-Write 21.5
## 19 LIX 65.24 > 11
## 20 nWS1 10.57
## 21 nWS2 11.07
## 22 nWS3 10.58
## 23 nWS4 11.89
## 24 RIX 10.61 > 12 (college)
## 25 SMOG 17.19 22.2
## 26 Strain 15.5
## 27 Traenkle-Bailer TB1 18.59
## 28 Traenkle-Bailer TB2 27.15
## 29 TRI -56.88
## 30 Tuldava 5.74
## 31 Wheeler-Smith 123.33 > 4
The summary()
method supports an additional flat format, which basically turns the table into a named numeric vector, using the raw values (because all indices have raw values, but only a few more than that). This format comes very handy when you want to use the output in further calculations:
## ARI Coleman.C1 Coleman.C2
## 18.73 39.00 37.00
## Coleman.C3 Coleman.C4 Coleman.Liau
## 35.00 36.00 33.00
## Danielson.Bryan.DB1 Danielson.Bryan.DB2 Dickes.Steiwer
## 9.86 26.39 32.21
## ELF Farr.Jenkins.Paterson Flesch
## 12.33 33.19 33.98
## Flesch.Kincaid FOG FORCAST
## 16.19 18.69 10.99
## Fucks Gutierrez Linsear.Write
## 51.67 33.48 21.50
## LIX nWS1 nWS2
## 65.24 10.57 11.07
## nWS3 nWS4 RIX
## 10.58 11.89 10.61
## SMOG Strain Traenkle.Bailer.TB1
## 17.19 15.50 18.59
## Traenkle.Bailer.TB2 TRI Tuldava
## 27.15 -56.88 5.74
## Wheeler.Smith
## 123.33
If you’re interested in a particular formula, again a wrapper function might be more convenient:
flesch.res <- flesch(tagged.text, hyphen=hyph.txt.en)
lix.res <- LIX(tagged.text) # LIX doesn't need syllable count
lix.res
##
## Läsbarhetsindex (LIX)
## Parameters: default
## Index: 65.24
## Rating: very difficult
## Grade: > 11
##
## Text language: en
It is possible to calculate the readability measures from the relevant key values directly, rather than analyze an actual text, by using readability.num()
instead of readability()
. If you need to reanalyze a particular text, this can be considerably faster. Therefore, all objects returned by readability()
can directly be fed to readability.num()
, since all relevant data is present in the desc
slot.
Another feature of this package is the detection of the language a text was (most probably) written in. This is done by gzipping reference texts in known languages, gzipping them again with addition of a small sample of the text in unknown language, and determining the case where the additional sample causes the smallest increase in file size (as described in Benedetto, Caglioti, & Loreto, 2002). By default, the compressed objects will be created in memory only.
To use the function guess.lang()
, you first need to download the reference material. In this implementation, the Universal Declaration of Human Rights in unicode formatting is used, because the document holds the world record of beeing the text translated into the most languages, and is publicly available. Please get the zipped archive with all translations in .txt format. You can, but don’t have to unzip the archive. The text to find the language of must also be in a unicode .txt file:
guessed <- guess.lang(
file.path(find.package("koRpus"),"tests","testthat","sample_text.txt"),
udhr.path="~/downloads/udhr_txt.zip"
)
summary(guessed)
## Estimated language: English
## Identifier: eng
## Region: Europe
##
## 435 different languages were checked.
##
## Distribution of compression differences:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 136.0 168.0 176.0 190.7 184.0 280.0
##
## SD: 38.21
##
## Top 5 guesses:
## name iso639-3 bcp47 region diff diff.std
## 1 English eng en Europe 136 -1.430827
## 2 Scots sco sco Europe 136 -1.430827
## 3 Pidgin, Nigerian pcm pcm Africa 144 -1.221473
## 4 Catalan-Valencian-Balear cat ca Europe 152 -1.012119
## 5 French fra fr Europe 152 -1.012119
##
## Last 5 guesses:
## name iso639-3 bcp47 region diff diff.std
## 431 Burmese mya my Asia 280 2.337547
## 432 Shan shn shn Asia 280 2.337547
## 433 Tamil tam ta Asia 280 2.337547
## 434 Vietnamese (Han nom) vie vi-Hani Asia 280 2.337547
## 435 Chinese, Yue yue yue Asia 280 2.337547
koRpus
The language support of this package has a modular design. There are some pre-built language packages in the l10n
repository, and with a little effort you should be able to add new languages yourself. You need the package sources for this, then basically you will have to add a new file to it and rebuild/reinstall the package. More details on this topic can be found in inst/README.languages
. Once you got a new language to work with koRpus
, I’d be happy to include your module in the official distribution.
Despite its name, the scope of koRpus
is single texts. If you would like to do analysis an a full corpus of texts, have a look at the plugin package tm.plugin.koRpus
.
The APA style used in this vignette was kindly provided by the CSL project, licensed under Creative Commons Attribution-ShareAlike 3.0 Unported license.
Benedetto, D., Caglioti, E., & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.
Bormuth, J. R. (1968). Cloze test readability: Criterion reference scores. Journal of Educational Measurement, 5(3), 189–196.
Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 11–28.
Liang, F. M. (1983). Word hy-phen-a-tion by com-put-er (PhD thesis). Stanford University, Dept. Computer Science, Stanford.
Maas, H. D. (1972). Über den Zusammenhang zwischen Wortschatzumfang und Länge eines Textes. Zeitschrift Für Literaturwissenschaft Und Linguistik, 2(8), 73–79.
McCarthy, P. M., & Jarvis, S. (2007). Vocd – a theoretical and empirical evaluation. Language Testing, 24(4), 459–488.
McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392.
Quasthoff, U., Richter, M., & Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the fifth international conference on language resources and evaluation (pp. 1799–1802). Genoa.
Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International conference on new methods in language processing (pp. 44–49). Manchester, UK.
Spache, G. (1953). A new readability formula for primary-grade reading materials. The Elementary School Journal, 53(7), 410–413.
Tweedie, F. J., & Baayen, R. H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352.
For a free implementation try https://strawberryperl.com↩
Like https://7-zip.org↩
Unfortunately, these language packages did not get the approval of the CRAN maintainers and are officially hosted at (https://undocumeantit.github.io/repos/l10n/)[https://undocumeantit.github.io/repos/l10n/]. For your convenience the function install.koRpus.lang()
can be used to easily install them anyway.↩
Presets are defined in the language support packages, usually named like their respective two-character language identifier. Refer to their documentation.↩
For informtaion on the measures shown see Tweedie & Baayen (1998), McCarthy & Jarvis (2007), McCarthy & Jarvis (2010).↩
Characteristics can be looked at to examine each measure’s dependency on text length. They are calculated by computing each measure repeatedly, beginning with only the first token, then adding the next, progressing until the full text was analyzed.↩
The package also has a function called read.corp.custom()
which can be used to process language corpora yourself, and store the results in an object of class kRp.corp.freq
, which is the class returned by read.corp.LCC()
and read.corp.celex()
as well. That is, if you can’t get any already analyzed corpus database but have a huge language corpus at hand, you can create your own frequency database. But be warned that depending on corpus size and your hardware, this might take ages. On the other hand, read.corp.custom()
will provide inverse document frequency (idf) values for all types, which is necessary to compute tf-idf with freq.analysis()
↩
Actually, it unterstands two different LCC formats, both the older .zip and the newer .tar archive format.↩
This can easily be done by calling as(freq.analysis.res, "kRp.text")
.↩
This sorting depends on proper POS-tagging, so this will only contain useful data if you used treetag()
instead of tokenize()
.↩
And can then be corrected by using the function correct.tag()
↩
Whether this is the case can be looked up in the documentation.↩
The hyphen()
method was originally implemented as part of the koRpus
package, but was later split off into its own package called sylly
.↩
For example, this can be comfortably done with RKWard: https://rkward.kde.org↩
Please note that as of version 0.04-18, the correctness of some of these calculations has not been extensively validated yet. The package was released nonetheless, also to find outstanding bugs in the implemented measures. Any information on the validity of its results is very welcome!↩
Measures which rely on word lists will be skipped if no list is provided.↩