Why ‘pluralize’?

Bob Rudis

2020-06-05

There are times when one needs to do some lightweight text mining and need to normalize words without the baggage that comes along with dependencies such as tm or OpenNLP.

One of the steps in this normalization process that is often helpful is to take plural words and make them singular vs “stem” them (which often leaves interesting prefixes that work but aren’t exactly “pretty”.)

The English language makes this a bit more gnarly at times and there are also cases where one might want to bend the rules a bit (i.e. discipline-specific terminology that may not abide by the rules).

While many words can be made plural by just adding an “s” there are cases where one has to add “es”, “ies” depending on the ending consonant or vowel.

There are further cases, like “knife”, where there are special rules that force you to do things like make that “knives” to get the plural form.

Then, there are irregular plurals like “fish” => “fish” and “person” => “people” and others like “cactus” => “cacti” or “erratum” => “errata”.

However, you may need to override normal behavior. For instance, “paper” would normally turn into “papers” but you may need it to stay “paper”.

This package is based on a javascript library that has encoded the core pluralization (and singularization) rules and exposed various functions to aid in maniuplating strings, including the ability to add special case rules (like the “paper” example).

Here are some examples to show the functionality:

Basic pluralization

library(pluralize)

(plurals <- pluralize(c("snake", "window", "box", "boy", "lorry",
                        "potato", "knife")))
## [1] "snakes"   "windows"  "boxes"    "boys"     "lorries"  "potatoes" "knives"

and the reverse:

singularize(plurals)
## [1] "snake"  "window" "box"    "boy"    "lorry"  "potato" "knife"

ending “o”:

(plurals <- pluralize(c("echo", "embargo", "hero", "potato", "tomato",
                        "torpedo", "veto")))
## [1] "echoes"    "embargos"  "heroes"    "potatoes"  "tomatoes"  "torpedoes"
## [7] "vetos"
singularize(plurals)
## [1] "echo"    "embargo" "hero"    "potato"  "tomato"  "torpedo" "veto"

others ending “o”:

(plurals <- pluralize(c("auto", "kangaroo", "kilo", "memo", "photo", "piano",
                        "pimento", "pro", "solo", "soprano", "studio", "tattoo",
                        "video", "zoo")))
##  [1] "autos"     "kangaroos" "kilos"     "memos"     "photos"    "pianos"   
##  [7] "pimentos"  "pros"      "solos"     "sopranos"  "studios"   "tattoos"  
## [13] "videos"    "zoos"
singularize(plurals)
##  [1] "auto"     "kangaroo" "kilo"     "memo"     "photo"    "piano"   
##  [7] "pimento"  "pro"      "solo"     "soprano"  "studio"   "tattoo"  
## [13] "video"    "zoo"

Some special nouns:

(plurals <- pluralize(c("knife", "leaf", "hoof", "life", "self", "elf")))
## [1] "knives" "leaves" "hooves" "lives"  "selves" "elves"
singularize(plurals)
## [1] "knife" "leaf"  "hoof"  "life"  "self"  "elf"

Some irregular nouns:

(plurals <- pluralize(c("fish", "sheep", "foot", "tooth", "goose",
                        "child", "man", "woman", "person", "mouse")))
##  [1] "fish"     "sheep"    "feet"     "teeth"    "geese"    "children"
##  [7] "men"      "women"    "people"   "mice"
singularize(plurals)
##  [1] "fish"   "sheep"  "foot"   "tooth"  "goose"  "child"  "man"    "woman" 
##  [9] "person" "mouse"

Again, you may want to do something special for certain words:

add_plural_rule("corpus", "corpora")
add_plural_rule("formula", "formulae")
add_plural_rule("memoranda", "memorandums")

add_singular_rule("corpora", "corpus")
add_singular_rule("formulae", "formula")
add_singular_rule("memorandums", "memoranda")

(plurals <- pluralize(c("corpus", "formula", "memoranda")))
## [1] "corpora"     "formulae"    "memorandums"
singularize(plurals)
## [1] "corpus"    "formula"   "memoranda"

Since this package is based on a javascript library - pluralize.js by Blake Embrey - and uses the V8 package it won’t be as lightning fast as a pure C/C++-backed package (it takes ~10 seconds to pluralize/singularize 10,000 words which is kinda horrible but sufficient for this project I’m working on), but it wouldn’t take much to translate the rules built into pluralize.js over to C++.