Example of stringdist_inner_join: Correcting misspellings against a dictionary

David Robinson

2020-05-14

Often you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.

The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):

library(dplyr)
library(fuzzyjoin)
data(misspellings)

misspellings
## # A tibble: 4,505 x 2
##    misspelling correct   
##    <chr>       <chr>     
##  1 abandonned  abandoned 
##  2 aberation   aberration
##  3 abilties    abilities 
##  4 abilty      ability   
##  5 abondon     abandon   
##  6 abbout      about     
##  7 abotu       about     
##  8 abouta      about a   
##  9 aboutit     about it  
## 10 aboutthe    about the 
## # … with 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
words
## # A tibble: 20,137 x 2
##    word  syllables
##    <chr>     <dbl>
##  1 hm            1
##  2 hmm           1
##  3 hmmm          1
##  4 hmph          1
##  5 mmhmm         2
##  6 mmhm          2
##  7 mm            1
##  8 mmm           1
##  9 mmmm          1
## 10 pff           1
## # … with 20,127 more rows

As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join to join them against our dictionary.

set.seed(2016)
sub_misspellings <- misspellings %>%
  sample_n(1000)
joined <- sub_misspellings %>%
  stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)

By default, stringdist_inner_join uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling is close to (but not equal to) word:

joined
## # A tibble: 760 x 4
##    misspelling correct    word       syllables
##    <chr>       <chr>      <chr>          <dbl>
##  1 cyclinder   cylinder   cylinder           3
##  2 beastiality bestiality bestiality         5
##  3 affilate    affiliate  affiliate          4
##  4 supress     suppress   suppress           2
##  5 intevene    intervene  intervene          3
##  6 resaurant   restaurant restaurant         3
##  7 univesity   university university         5
##  8 allegedely  allegedly  allegedly          4
##  9 emiting     emitting   smiting            2
## 10 probaly     probably   probably           3
## # … with 750 more rows

Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?

joined %>%
  count(misspelling, correct)
## # A tibble: 462 x 3
##    misspelling correct         n
##    <chr>       <chr>       <int>
##  1 abilty      ability         1
##  2 accademic   academic        1
##  3 accademy    academy         1
##  4 accension   accession       2
##  5 acceptence  acceptance      1
##  6 acedemic    academic        1
##  7 achive      achieve         4
##  8 acommodate  accommodate     1
##  9 acuracy     accuracy        1
## 10 addmission  admission       1
## # … with 452 more rows

So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?

which_correct <- joined %>%
  group_by(misspelling, correct) %>%
  summarize(guesses = n(), one_correct = any(correct == word))

which_correct
## # A tibble: 462 x 4
## # Groups:   misspelling [453]
##    misspelling correct     guesses one_correct
##    <chr>       <chr>         <int> <lgl>      
##  1 abilty      ability           1 TRUE       
##  2 accademic   academic          1 TRUE       
##  3 accademy    academy           1 TRUE       
##  4 accension   accession         2 TRUE       
##  5 acceptence  acceptance        1 TRUE       
##  6 acedemic    academic          1 TRUE       
##  7 achive      achieve           4 TRUE       
##  8 acommodate  accommodate       1 TRUE       
##  9 acuracy     accuracy          1 TRUE       
## 10 addmission  admission         1 TRUE       
## # … with 452 more rows
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8246753
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 290

Not bad.

Note that stringdist_inner_join is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have used stringdist_left_join:

left_joined <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)

left_joined
## # A tibble: 1,298 x 4
##    misspelling   correct       word       syllables
##    <chr>         <chr>         <chr>          <dbl>
##  1 Sanhedrim     Sanhedrin     <NA>              NA
##  2 cyclinder     cylinder      cylinder           3
##  3 beastiality   bestiality    bestiality         5
##  4 consicousness consciousness <NA>              NA
##  5 affilate      affiliate     affiliate          4
##  6 repubicans    republicans   <NA>              NA
##  7 comitted      committed     <NA>              NA
##  8 emmisions     emissions     <NA>              NA
##  9 acquited      acquitted     <NA>              NA
## 10 decompositing decomposing   <NA>              NA
## # … with 1,288 more rows
left_joined %>%
  filter(is.na(word))
## # A tibble: 538 x 4
##    misspelling   correct       word  syllables
##    <chr>         <chr>         <chr>     <dbl>
##  1 Sanhedrim     Sanhedrin     <NA>         NA
##  2 consicousness consciousness <NA>         NA
##  3 repubicans    republicans   <NA>         NA
##  4 comitted      committed     <NA>         NA
##  5 emmisions     emissions     <NA>         NA
##  6 acquited      acquitted     <NA>         NA
##  7 decompositing decomposing   <NA>         NA
##  8 decieved      deceived      <NA>         NA
##  9 asociated     associated    <NA>         NA
## 10 commonweath   commonwealth  <NA>         NA
## # … with 528 more rows

(To get just the ones without matches immediately, we could have used stringdist_anti_join). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:

left_joined2 <- sub_misspellings %>%
  stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)

left_joined2
## # A tibble: 8,721 x 4
##    misspelling   correct       word       syllables
##    <chr>         <chr>         <chr>          <dbl>
##  1 Sanhedrim     Sanhedrin     <NA>              NA
##  2 cyclinder     cylinder      cylinder           3
##  3 beastiality   bestiality    bestiality         5
##  4 consicousness consciousness <NA>              NA
##  5 affilate      affiliate     affiliate          4
##  6 repubicans    republicans   <NA>              NA
##  7 comitted      committed     committee          3
##  8 emmisions     emissions     <NA>              NA
##  9 acquited      acquitted     acquire            2
## 10 acquited      acquitted     acquit             2
## # … with 8,711 more rows
left_joined2 %>%
  filter(is.na(word))
## # A tibble: 286 x 4
##    misspelling   correct        word  syllables
##    <chr>         <chr>          <chr>     <dbl>
##  1 Sanhedrim     Sanhedrin      <NA>         NA
##  2 consicousness consciousness  <NA>         NA
##  3 repubicans    republicans    <NA>         NA
##  4 emmisions     emissions      <NA>         NA
##  5 commonweath   commonwealth   <NA>         NA
##  6 supressed     suppressed     <NA>         NA
##  7 aproximately  approximately  <NA>         NA
##  8 Missisippi    Mississippi    <NA>         NA
##  9 lazyness      laziness       <NA>         NA
## 10 constituional constitutional <NA>         NA
## # … with 276 more rows

Most of the missing words here simply aren’t in our dictionary.

You can try other distance thresholds, other dictionaries, and other distance metrics (see stringdist-metrics for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.