Lua word count filter

Frederik Aust

2020-11-25

Using the word count filter

The aim of the rmdfiltr word count filter is to provide a more accurate estimate of the number of words in a document than can be gleaned from the R Markdown source document. Output from (inline) R chunks as well as formatted citations and references can not enter the word count, when the source document is analyzed. Hence, the word count filter is applied after the document has been knitted and while it is being processed by pandoc. At this stage, the document is represented as an abstract syntax tree (AST), a semantic nested list, and can be manipulated by applying so-called filters.

One the filters that is applied to R Markdown by default is citeproc (previously pandoc-citeproc), which formats citations and inserts references. To obtain an accurate estimate, the word count filter should therefore be applied after citeproc has been applied. To do so, it is necessary to disable the default application of citeproc, because it is always applied last, by adding the following to the documents YAML front matter:

citeproc: no

To manually apply citeproc and subsequently the rmdfiltr word count filter add the pandoc arguments to the output format of your R Markdown document as pandoc_args. Each filter returns a vector of command line arguments; they take previous arguments as args and add to them. Hence, the calls to add filters can be nested:

library("rmdfiltr")
add_citeproc_filter(args = NULL)
#> [1] "--citeproc"
add_wordcount_filter(add_citeproc_filter(args = NULL))
#> [1] "--citeproc"                                                                                                 
#> [2] "--lua-filter"                                                                                               
#> [3] "/private/var/folders/nv/mz4ffsbn045101ngdd_mx0th0000gn/T/Rtmp7tDXDt/Rinst21297336bb2/rmdfiltr/wordcount.lua"

When adding the filters to pandoc_args the R code needs to be preceded by !expr to declare it as to-be-interpreted expression.

output:
  html_document:
    pandoc_args: !expr rmdfiltr::add_wordcount_filter(rmdfiltr::add_citeproc_filter(args = NULL))

The word count filter reports the word counts in the console or the R Markdown tab in RStudio, respectively.

285 words in text body
23 words in reference section

Word count filter performance

The rmdfiltr filter is and adapted combination of two other Lua-filters by John MacFarlane and contributors.

Although word counting appears to be a trivial matter, the counts of different methods often disagree. The magnitude of those disagreements depends on the complexity of the document.

To get a feeling for the performance of the word count filter, I briefly compared the estimates for two documents across several common methods. The first document, a paper by Stahl & Aust (2018) is a rather simple consisting of only text with citations and a reference section. The second document is a more complicated—it contains math, code, verbatim output, etc.

The word counts for the text body do not contain, tables or images (or their captions), or the reference section (which required some manual labor in Word, Pages, and wordcounter.net).

Overall, all methods provide similar estimates for the text body of the simple document. Although the document contains a considerable number of citations, the wordcountaddin which is applied to the R Markdown source file before citeproc, provides a good estimate. As expected there is less agreement on the word count for the shorter and more complex document. In particular, the texcount word count is off—it displayed several errors related to the displayed R code and verbatim output. I think the errors may have caused texcount to ignore some bits and are probably the reason for the low word count of the text body. Similarly, the wordcountaddin cannot count the verbatim output.

The pattern for the reference sections of the simple and complex documents are comparable. Pages and texcount count more words than Word, wordcounter.net and the rmdfiltr word count filter. I suspect the difference is due to how the methods handle the URLs in the references. The wordcountaddin cannot provide a word count for reference sections.

Overall I’m fairly happy with the performance of the rmdfiltr filter. The word counts are quite similar to those of the majority of the other methods. I’m sure the filter can be improved (and I’ll gladly take any suggestion) but I think in its current form it is a decent solution.

References

Stahl, C., & Aust, F. (2018). Evaluative conditioning as memory-based judgment. Social Psychological Bulletin, 13(3), Article e28589. https://doi.org/10.5964/spb.v13i3.28589