fm.index is an R package providing a fast data structure (FM Index) for finding occurrences of string snippets in large libraries of strings (corpus).
Partial string matching can be ~50-fold faster than simple string scans for many real world string collections. A given corpus is converted into a compact in-memory FM index representation that can be efficiently queried for partial string matches.
fm.index wraps the C++ library SDSL v3 and uses a Compressed Suffix Array based on a Wavelet Tree of the Burrow-Wheeler Transform of the given string library.
You can install the development version from GitHub with:
In this example we search a library of state names for all states that contain the substring “new”.
library(fm.index)
data("state")
print(state.name)
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas"
#> [5] "California" "Colorado" "Connecticut" "Delaware"
#> [9] "Florida" "Georgia" "Hawaii" "Idaho"
#> [13] "Illinois" "Indiana" "Iowa" "Kansas"
#> [17] "Kentucky" "Louisiana" "Maine" "Maryland"
#> [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
#> [25] "Missouri" "Montana" "Nebraska" "Nevada"
#> [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
#> [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
#> [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
#> [41] "South Dakota" "Tennessee" "Texas" "Utah"
#> [45] "Vermont" "Virginia" "Washington" "West Virginia"
#> [49] "Wisconsin" "Wyoming"
index <- fm_index_create(state.name, case_sensitive = FALSE)
hits <- fm_index_locate("new", index)
print(hits)
#> pattern_index corpus_index position
#> 1 1 29 1
#> 2 1 30 1
#> 3 1 31 1
#> 4 1 32 1
The result is a data frame with three columns. pattern_index
is the index of the query pattern, corpus_index
is the index of the matching string in the string corpus, and position
is the starting position of the match in the indexed string. All indices are 1-based.
In order to extract the matching states we can use the corpus_index
to subset the vector of original state names.
The speedup achieved by using fm.index over simple string scans, for example using grepl()
or stringi::stri_locate()
, depends on the kind of strings in the library.
For this example, we can generate a million random strings of length 50 and search them for all occurrences of the letters “ab”.
library(stringi)
library(microbenchmark)
set.seed(42)
lib_random <- stri_rand_strings(1000000, 50)
idx_random <- fm_index_create(lib_random)
head(fm_index_locate("ab", idx_random))
#> pattern_index corpus_index position
#> 1 1 792988 36
#> 2 1 251840 23
#> 3 1 875759 1
#> 4 1 341798 47
#> 5 1 56031 50
#> 6 1 376499 40
random_benchmark <- microbenchmark(
fm.index = fm_index_locate("ab", idx_random),
stringi = stri_locate_all_coll(lib_random, "ab"),
times = 3
)
print(random_benchmark)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> fm.index 927.7187 937.2335 947.8389 946.7483 957.899 969.0497 3 a
#> stringi 3732.3779 3772.4135 3833.5300 3812.4491 3884.106 3955.7630 3 b
Random strings are hard to compress, so searching using an FM Index only yields a modest ~4-fold speedup.
Real-world text is usually repetitive and much easier to compress than random strings. In this example, we search each line of a book for all instances of the word “help”.
book_path <- tempfile()
download.file("http://aleph.gutenberg.org/1/2/3/7/12370/12370.zip", book_path)
book <- scan(unz(book_path, "12370.txt"), sep = "\n", what = "character")
Five random lines from the book.
book[sample(length(book), 5)]
#> [1] "bring her a true and particular account of that strange circumstance,"
#> [2] "it. For a long while this was my way, that whatever living beings"
#> [3] "reason his appearance is such; he is burning with the fire of love; how"
#> [4] "prosperous. What strange fancy has possessed the royal mind! If to this"
#> [5] "water of immortality, and that in consequence of having drunk thereof,"
idx_book <- fm_index_create(book)
help_locations <- fm_index_locate("water", idx_book)
nrow(help_locations)
#> [1] 73
head(help_locations)
#> pattern_index corpus_index position
#> 1 1 5262 43
#> 2 1 5533 16
#> 3 1 4097 23
#> 4 1 6969 18
#> 5 1 4130 13
#> 6 1 4141 46
The number of rows in the resulting data frame tells us that the word “water” occurs 73 times.
book_benchmark <- microbenchmark(
fm.index = fm_index_locate("water", idx_book),
stringi = stri_locate_all_coll(book, "water"),
times = 10
)
print(book_benchmark)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fm.index 482.183 522.51 585.4968 528.3555 725.481 772.178 10
#> stringi 26649.996 27683.24 29156.1119 28875.5665 30187.416 32227.820 10
#> cld
#> a
#> b
The text from this book is encoded very efficiently by the FM Index, resulting in a ~55-fold speedup.
This work was supported by NIH grants U54-HL127365, U24-DK116204, and U54-HL127624.
This package is provided under the MIT license. The bundled SDSL library is licensed under the BSD license.