morphemepiece: Morpheme Tokenization

Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table.

Version: 1.2.3
Imports: dlr (≥ 1.0.0), fastmatch, magrittr, memoise (≥ 2.0.0), morphemepiece.data, piecemaker (≥ 1.0.0), purrr (≥ 0.3.4), readr, rlang, stringr (≥ 1.4.0)
Suggests: dplyr, fs, ggplot2, here, knitr, remotes, rmarkdown, testthat (≥ 3.0.0), utils
Published: 2022-04-16
Author: Jonathan Bratt ORCID iD [aut, cre], Jon Harmon ORCID iD [aut], Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph]
Maintainer: Jonathan Bratt <jonathan.bratt at macmillan.com>
BugReports: https://github.com/macmillancontentscience/morphemepiece/issues
License: Apache License (≥ 2)
URL: https://github.com/macmillancontentscience/morphemepiece
NeedsCompilation: no
Materials: README NEWS
CRAN checks: morphemepiece results

Documentation:

Reference manual: morphemepiece.pdf
Vignettes: Testing the fall-through algorithm
Generating a Vocabulary and Lookup

Downloads:

Package source: morphemepiece_1.2.3.tar.gz
Windows binaries: r-devel: morphemepiece_1.2.3.zip, r-release: morphemepiece_1.2.3.zip, r-oldrel: morphemepiece_1.2.3.zip
macOS binaries: r-release (arm64): morphemepiece_1.2.3.tgz, r-oldrel (arm64): morphemepiece_1.2.3.tgz, r-release (x86_64): morphemepiece_1.2.3.tgz, r-oldrel (x86_64): morphemepiece_1.2.3.tgz
Old sources: morphemepiece archive

Linking:

Please use the canonical form https://CRAN.R-project.org/package=morphemepiece to link to this page.