tokenizers: Fast, Consistent Tokenization of Natural Language Text
Convert natural language text into tokens. Includes tokenizers for
shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs,
characters, shingled characters, lines, tweets, Penn Treebank, regular
expressions, as well as functions for counting characters, words, and sentences,
and a function for splitting longer texts into separate documents, each with
the same number of words. The tokenizers have a consistent interface, and
the package is built on the 'stringi' and 'Rcpp' packages for fast
yet correct tokenization in 'UTF-8'.
Version: |
0.2.1 |
Depends: |
R (≥ 3.1.3) |
Imports: |
stringi (≥ 1.0.1), Rcpp (≥ 0.12.3), SnowballC (≥ 0.5.1) |
LinkingTo: |
Rcpp |
Suggests: |
covr, knitr, rmarkdown, stopwords (≥ 0.9.0), testthat |
Published: |
2018-03-29 |
Author: |
Lincoln Mullen
[aut, cre],
Os Keyes [ctb],
Dmitriy Selivanov [ctb],
Jeffrey Arnold
[ctb],
Kenneth Benoit
[ctb] |
Maintainer: |
Lincoln Mullen <lincoln at lincolnmullen.com> |
BugReports: |
https://github.com/ropensci/tokenizers/issues |
License: |
MIT + file LICENSE |
URL: |
https://lincolnmullen.com/software/tokenizers/ |
NeedsCompilation: |
yes |
Citation: |
tokenizers citation info |
Materials: |
README NEWS |
In views: |
NaturalLanguageProcessing |
CRAN checks: |
tokenizers results |
Documentation:
Downloads:
Reverse dependencies:
Reverse imports: |
covfefe, deeplr, DeepPINCS, DramaAnalysis, epitweetr, pdfsearch, proustr, rslp, textfeatures, textrecipes, tidypmc, tidytext, ttgsea, wactor |
Reverse suggests: |
cwbtools, edgarWebR, quanteda, torchdatasets |
Linking:
Please use the canonical form
https://CRAN.R-project.org/package=tokenizers
to link to this page.