This vignette covers the old three argument syntax used in the
*_named
functions; see the other vignette “recommended variable
argument syntax” for documentation about the *_variable
functions.
The str_match_named
and str_match_all_named
functions take exactly three arguments as input:
subject
is the character vector from which we want to extract
tabular data.pattern
is the (character scalar) regular expression with named
capture groups used for extraction. The named capture groups should be literally specified, e.g. (?P<group_name>subpattern)
fun.list
is a list with names that correspond to capture groups,
and values are functions used to convert the extracted character
data to other (typically numeric) types.For example, consider the character vector of subject strings below:
subject.vec <- c(
"chr10:213,054,000-213,055,000",
"chrM:111,000",
"this will not match",
NA, # neither will this.
"chr1:110-111 chr2:220-222") # two possible matches.
With these genomic range subjects, the goal is to extract the chromosome name (before the colon), along with the numeric start and end locations (separated by a dash). That pattern is coded below,
chr.pos.pattern <- paste0(
"(?P<chrom>chr.*?)",
":",
"(?P<chromStart>[0-9,]+)",
"(?:",
"-",
"(?P<chromEnd>[0-9,]+)",
")?")
Note that it is often preferable (as above) to code the pattern using paste0, in order to put different parts of the pattern on each line. In particular, I recommend putting each named capture group on its own line – this is one of the ideas that is used in the variable argument syntax (see other vignette for more info).
Using the pattern on the subjects above results in
(match.mat <- namedCapture::str_match_named(subject.vec, chr.pos.pattern))
#> chrom chromStart chromEnd
#> [1,] "chr10" "213,054,000" "213,055,000"
#> [2,] "chrM" "111,000" ""
#> [3,] NA NA NA
#> [4,] NA NA NA
#> [5,] "chr1" "110" "111"
str(match.mat)
#> chr [1:5, 1:3] "chr10" "chrM" NA NA "chr1" "213,054,000" "111,000" NA NA ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:3] "chrom" "chromStart" "chromEnd"
Note that the third argument (list of conversion functions) is omitted in the code above. In that case, the return value is a character matrix, in which missing values indicate missing subjects or no match. The empty string is used for optional groups which are not used in the match (e.g. chromEnd group/column for second subject).
However we often want to extract numeric data – in this case we want to convert chromStart/End to integers. You can do that by supplying a named list of conversion functions as the third argument. Each function should take exactly one argument, a character vector (data in the matched column/group), and return a vector of the same size. The code below specifies the keep.digits
function for both chromStart
and chromEnd
.
keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
conversion.list <- list(chromStart=keep.digits, chromEnd=keep.digits)
(match.df <- namedCapture::str_match_named(
subject.vec, chr.pos.pattern, conversion.list))
#> chrom chromStart chromEnd
#> 1 chr10 213054000 213055000
#> 2 chrM 111000 NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 chr1 110 111
str(match.df)
#> 'data.frame': 5 obs. of 3 variables:
#> $ chrom : chr "chr10" "chrM" NA NA ...
#> $ chromStart: int 213054000 111000 NA NA 110
#> $ chromEnd : int 213055000 NA NA NA 111
Note that a data.frame
is returned when the third argument is specified, in order to handle non-character data types returned by the conversion functions.
Note in the examples above that the last subject has two possible
matches, but only the first is returned by str_match_named
. Use
str_match_all_named
to get ALL matches in each subject (not just the
first match).
namedCapture::str_match_all_named(
subject.vec, chr.pos.pattern, conversion.list)
#> [[1]]
#> chrom chromStart chromEnd
#> 1 chr10 213054000 213055000
#>
#> [[2]]
#> chrom chromStart chromEnd
#> 1 chrM 111000 NA
#>
#> [[3]]
#> data frame with 0 columns and 0 rows
#>
#> [[4]]
#> data frame with 0 columns and 0 rows
#>
#> [[5]]
#> chrom chromStart chromEnd
#> 1 chr1 110 111
#> 2 chr2 220 222
As shown above, the result is a list with one element for each subject. Each list element is a data.frame with one row for each match.
If the pattern specifies the name
group, then it will be used for
the rownames of the output, and it will not be included as a column. For example the pattern below uses name
for the first group:
name.pattern <- paste0(
"(?P<name>chr.*?)",
":",
"(?P<chromStart>[0-9,]+)",
"(?:",
"-",
"(?P<chromEnd>[0-9,]+)",
")?")
try(named.mat <- namedCapture::str_match_named(
subject.vec, name.pattern, conversion.list))
#> name chromStart chromEnd
#> 3 <NA> <NA> <NA>
#> 4 <NA> <NA> <NA>
#> Error in apply_type_funs(m, type.list) :
#> the 'name' group should not be missing/NA
(named.mat <- namedCapture::str_match_named(
subject.vec[-(3:4)], name.pattern, conversion.list))
#> chromStart chromEnd
#> chr10 213054000 213055000
#> chrM 111000 NA
#> chr1 110 111
(named.list <- namedCapture::str_match_all_named(
subject.vec, name.pattern, conversion.list))
#> [[1]]
#> chromStart chromEnd
#> chr10 213054000 213055000
#>
#> [[2]]
#> chromStart chromEnd
#> chrM 111000 NA
#>
#> [[3]]
#> data frame with 0 columns and 0 rows
#>
#> [[4]]
#> data frame with 0 columns and 0 rows
#>
#> [[5]]
#> chromStart chromEnd
#> chr1 110 111
#> chr2 220 222
Note in the above code we use try
because it is an error if any
name
groups are missing (and they are for the subjects 3 and 4).
The named output feature makes it easy to select particular elements of the extracted data by name, e.g.
named.mat["chr1", "chromStart"]
#> [1] 110
named.list[[5]]["chr2", "chromStart"]
#> [1] 220
Note that if the subject is named, its names will be used to name the output (rownames or list names).
named.subject.vec <- c(
ten="chr10:213,054,000-213,055,000",
M="chrM:111,000",
nomatch="this will not match",
missing=NA, # neither will this.
two="chr1:110-111 chr2:220-222") # two possible matches.
namedCapture::str_match_named(
named.subject.vec, chr.pos.pattern, conversion.list)
#> chrom chromStart chromEnd
#> ten chr10 213054000 213055000
#> M chrM 111000 NA
#> nomatch <NA> NA NA
#> missing <NA> NA NA
#> two chr1 110 111
namedCapture::str_match_all_named(
named.subject.vec, chr.pos.pattern, conversion.list)
#> $ten
#> chrom chromStart chromEnd
#> 1 chr10 213054000 213055000
#>
#> $M
#> chrom chromStart chromEnd
#> 1 chrM 111000 NA
#>
#> $nomatch
#> data frame with 0 columns and 0 rows
#>
#> $missing
#> data frame with 0 columns and 0 rows
#>
#> $two
#> chrom chromStart chromEnd
#> 1 chr1 110 111
#> 2 chr2 220 222
If the subject has names, and the name
group is specified, then the subject names are used to name the output (and the name
column is included in the output).
named.subject.vec <- c(
ten="chr10:213,054,000-213,055,000",
M="chrM:111,000",
nomatch="this will not match",
missing=NA, # neither will this.
two="chr1:110-111 chr2:220-222") # two possible matches.
namedCapture::str_match_named(
named.subject.vec, name.pattern, conversion.list)
#> name chromStart chromEnd
#> ten chr10 213054000 213055000
#> M chrM 111000 NA
#> nomatch <NA> NA NA
#> missing <NA> NA NA
#> two chr1 110 111
namedCapture::str_match_all_named(
named.subject.vec, name.pattern, conversion.list)
#> $ten
#> chromStart chromEnd
#> chr10 213054000 213055000
#>
#> $M
#> chromStart chromEnd
#> chrM 111000 NA
#>
#> $nomatch
#> data frame with 0 columns and 0 rows
#>
#> $missing
#> data frame with 0 columns and 0 rows
#>
#> $two
#> chromStart chromEnd
#> chr1 110 111
#> chr2 220 222
Next, read the “recommended variable argument syntax” vignette for
information about how to use the *_variable
functions.