stringdist_inner_join
: Correcting misspellings against a dictionaryOften you find yourself with a set of words that you want to combine with a “dictionary”- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.
The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):
library(dplyr)
library(fuzzyjoin)
data(misspellings)
misspellings
## # A tibble: 4,505 x 2
## misspelling correct
## <chr> <chr>
## 1 abandonned abandoned
## 2 aberation aberration
## 3 abilties abilities
## 4 abilty ability
## 5 abondon abandon
## 6 abbout about
## 7 abotu about
## 8 abouta about a
## 9 aboutit about it
## 10 aboutthe about the
## # … with 4,495 more rows
# use the dictionary of words from the qdapDictionaries package,
# which is based on the Nettalk corpus.
library(qdapDictionaries)
words <- tbl_df(DICTIONARY)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
words
## # A tibble: 20,137 x 2
## word syllables
## <chr> <dbl>
## 1 hm 1
## 2 hmm 1
## 3 hmmm 1
## 4 hmph 1
## 5 mmhmm 2
## 6 mmhm 2
## 7 mm 1
## 8 mmm 1
## 9 mmmm 1
## 10 pff 1
## # … with 20,127 more rows
As an example, we’ll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join
to join them against our dictionary.
set.seed(2016)
sub_misspellings <- misspellings %>%
sample_n(1000)
joined <- sub_misspellings %>%
stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)
By default, stringdist_inner_join
uses optimal string alignment (Damerau–Levenshtein distance), and we’re setting a maximum distance of 1 for a join. Notice that they’ve been joined in cases where misspelling
is close to (but not equal to) word
:
joined
## # A tibble: 760 x 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 cyclinder cylinder cylinder 3
## 2 beastiality bestiality bestiality 5
## 3 affilate affiliate affiliate 4
## 4 supress suppress suppress 2
## 5 intevene intervene intervene 3
## 6 resaurant restaurant restaurant 3
## 7 univesity university university 5
## 8 allegedely allegedly allegedly 4
## 9 emiting emitting smiting 2
## 10 probaly probably probably 3
## # … with 750 more rows
Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per “guess” in the output. How many words did we classify?
joined %>%
count(misspelling, correct)
## # A tibble: 462 x 3
## misspelling correct n
## <chr> <chr> <int>
## 1 abilty ability 1
## 2 accademic academic 1
## 3 accademy academy 1
## 4 accension accession 2
## 5 acceptence acceptance 1
## 6 acedemic academic 1
## 7 achive achieve 4
## 8 acommodate accommodate 1
## 9 acuracy accuracy 1
## 10 addmission admission 1
## # … with 452 more rows
So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?
which_correct <- joined %>%
group_by(misspelling, correct) %>%
summarize(guesses = n(), one_correct = any(correct == word))
which_correct
## # A tibble: 462 x 4
## # Groups: misspelling [453]
## misspelling correct guesses one_correct
## <chr> <chr> <int> <lgl>
## 1 abilty ability 1 TRUE
## 2 accademic academic 1 TRUE
## 3 accademy academy 1 TRUE
## 4 accension accession 2 TRUE
## 5 acceptence acceptance 1 TRUE
## 6 acedemic academic 1 TRUE
## 7 achive achieve 4 TRUE
## 8 acommodate accommodate 1 TRUE
## 9 acuracy accuracy 1 TRUE
## 10 addmission admission 1 TRUE
## # … with 452 more rows
# percentage of guesses getting at least one right
mean(which_correct$one_correct)
## [1] 0.8246753
# number uniquely correct (out of the original 1000)
sum(which_correct$guesses == 1 & which_correct$one_correct)
## [1] 290
Not bad.
Note that stringdist_inner_join
is not the only function we can use. If we’re interested in including the words that we couldn’t classify, we could have used stringdist_left_join
:
left_joined <- sub_misspellings %>%
stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1)
left_joined
## # A tibble: 1,298 x 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 cyclinder cylinder cylinder 3
## 3 beastiality bestiality bestiality 5
## 4 consicousness consciousness <NA> NA
## 5 affilate affiliate affiliate 4
## 6 repubicans republicans <NA> NA
## 7 comitted committed <NA> NA
## 8 emmisions emissions <NA> NA
## 9 acquited acquitted <NA> NA
## 10 decompositing decomposing <NA> NA
## # … with 1,288 more rows
left_joined %>%
filter(is.na(word))
## # A tibble: 538 x 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 consicousness consciousness <NA> NA
## 3 repubicans republicans <NA> NA
## 4 comitted committed <NA> NA
## 5 emmisions emissions <NA> NA
## 6 acquited acquitted <NA> NA
## 7 decompositing decomposing <NA> NA
## 8 decieved deceived <NA> NA
## 9 asociated associated <NA> NA
## 10 commonweath commonwealth <NA> NA
## # … with 528 more rows
(To get just the ones without matches immediately, we could have used stringdist_anti_join
). If we increase our distance threshold, we’ll increase the fraction with a correct guess, but also get more false positive guesses:
left_joined2 <- sub_misspellings %>%
stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2)
left_joined2
## # A tibble: 8,721 x 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 cyclinder cylinder cylinder 3
## 3 beastiality bestiality bestiality 5
## 4 consicousness consciousness <NA> NA
## 5 affilate affiliate affiliate 4
## 6 repubicans republicans <NA> NA
## 7 comitted committed committee 3
## 8 emmisions emissions <NA> NA
## 9 acquited acquitted acquire 2
## 10 acquited acquitted acquit 2
## # … with 8,711 more rows
left_joined2 %>%
filter(is.na(word))
## # A tibble: 286 x 4
## misspelling correct word syllables
## <chr> <chr> <chr> <dbl>
## 1 Sanhedrim Sanhedrin <NA> NA
## 2 consicousness consciousness <NA> NA
## 3 repubicans republicans <NA> NA
## 4 emmisions emissions <NA> NA
## 5 commonweath commonwealth <NA> NA
## 6 supressed suppressed <NA> NA
## 7 aproximately approximately <NA> NA
## 8 Missisippi Mississippi <NA> NA
## 9 lazyness laziness <NA> NA
## 10 constituional constitutional <NA> NA
## # … with 276 more rows
Most of the missing words here simply aren’t in our dictionary.
You can try other distance thresholds, other dictionaries, and other distance metrics (see stringdist-metrics for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.