This R package edlibR
provides bindings to the C/C++ library edlib, which computes the exact pairwise sequence alignment using the edit distance (Levenshtein distance). The functions within edlibR
are modeled after the API of the Python package edlib on PyPI
There are three functions within edlibR
:
The first function provided by edlibR
is align()
. The function align()
computes the pairwise alignment of the input query
against the input target
:
align(query, target, [mode], [task], [k], [cigarFormat], [additionalEqualities])
A list is returned with the following fields:
query
and target
.list(c(start, end))
. Note: if the start or end positions are NULL, this is encoded as NA
to work correctly with R vectors.cigarFormat
in the function align()
which is returned here for the function getNiceAlignment()
. (Note: the function getNiceAlignment()
only accepts cigarFormat="extended"
.)library(edlibR)
= align("ACTG", "CACTRT", mode="HW", task="path")
algn1 print(algn1)
## $editDistance
## [1] 1
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 3
##
## $locations[[2]]
## [1] 1 4
##
##
## $cigar
## [1] "3=1I"
##
## $cigarFormat
## [1] "extended"
= align("elephant", "telephone")
algn2 print(algn2)
## $editDistance
## [1] 3
##
## $alphabetLength
## [1] 8
##
## $locations
## $locations[[1]]
## [1] NA 8
##
##
## $cigar
## NULL
##
## $cigarFormat
## [1] "extended"
= align("ACTG", "CACTRT", mode="HW", task="path")
algn3 print(algn3)
## $editDistance
## [1] 1
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 3
##
## $locations[[2]]
## [1] 1 4
##
##
## $cigar
## [1] "3=1I"
##
## $cigarFormat
## [1] "extended"
## the previous example with additionalEqualities
= align("ACTG", "CACTRT", mode="HW", task="path", additionalEqualities=list(c("R", "A"), c("R", "G")))
algn4 print(algn4)
## $editDistance
## [1] 0
##
## $alphabetLength
## [1] 5
##
## $locations
## $locations[[1]]
## [1] 1 4
##
##
## $cigar
## [1] "4="
##
## $cigarFormat
## [1] "extended"
edlibR
:
AACT
and target as AACTGGC
, the edit distance would be 0, because removing GGC
from the end of the second sequence is “free” and does not count into the total edit distance. This method is appropriate when you want to find out how well the first sequence fits at the beginning of the second sequence.ACT
and CGACTGAC
, the edit distance would be 0, because removing CG
from the start and GAC
from the end of the second sequence is “free” and does not count into the total edit distance. This method is appropriate when you want to find out how well the first sequence fits at any part of the second sequence. For example, if your second sequence was a long text and your first sequence was a sentence from that text, but slightly scrambled, you could use this method to discover how scrambled it is and where it fits in that text. In bioinformatics, this method is appropriate for aligning a read to a sequence.cigarFormat="extended"
):
The function getNiceAlignment() takes the output of align(), and represents this in a visually informative format for human inspection (“NICE format”). This will be an informative string showing the matches, mismatches, insertions, and deletions.
getNiceAlignment(alignResult, query, target, [gapSymbol])
Note: Users must use the argument task="path"
within align()
to output a CIGAR for getNiceAlignment()
; otherwise, there will be no CIGAR for getNiceAlignment()
to reconstruct the alignment in “NICE” format. Also, users must use the argument cigarFormat="extended"
within align(); otherwise, the CIGAR will be too ambiguous for getNiceAlignment() to correctly reconstruct the alignment() in “NICE” format.
library(edlibR)
= "elephant"
query = "telephone"
target = align(query, target, task = "path")
result = getNiceAlignment(result, query, target)
nice_algn print(nice_algn)
## $query_aligned
## [1] "-elephant"
##
## $matched_aligned
## [1] "-|||||.|."
##
## $target_aligned
## [1] "telephone"
align()
. As mentioned above, align()
must use the arguments task="path"
and cigarFormat="extended"
in order for the CIGAR to be informative enough for getNiceAlignment()
to work properly.alignResult
alignResult
query
and target
(default="-"
). This must be a single character, i.e. a string of length 1 (i.e. nchar(gapSymbol)
must equal 1).The function nice_print()
simply prints the output of getNiceAlignment()
to the console for quickly inspecting the alignment. Users can think of this function as a “pretty-print” function for visualization.
library(edlibR)
## example above from getNiceAlignment()
= "elephant"
query = "telephone"
target = align(query, target, task = "path")
result = getNiceAlignment(result, query, target)
nice_algn nice_print(nice_algn)
## [1] "query: -elephant"
## [1] "matched: -|||||.|."
## [1] "target: telephone"
For more information regarding edlib, please see the publication in Bioinformatics.