The inverseRegex package allows users to reverse engineer regular expression patterns for R objects. Individual characters that make up an object are categorised into common groups and encoded into run-lengths. For example, the phrase “Hello World!” can be translated to "[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"
.
This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a “nnnn-nn-nn” format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example).
The main function to use is inverseRegex(x)
which will identify the different characters that make up the input object x
. The different groups that can be identified are
'[[:digit:]]'
'[[:lower:]]'
'[[:upper:]]'
'[[:alpha:]]'
'[[:alnum:]]'
'[[:space:]]'
'[[:punct:]]'
See ?regex
for an explanation of their meanings.
By default the only groups that will be identified are [[:digit:]]
, [[:upper:]]
, and [[:lower:]]
, with any other characters being left as is. This can altered with the following arguments:
combineCases
: Use '[[:alpha:]]'
instead of '[[:lower:]]'
and '[[:upper:]]'
.combineAlphanumeric
: Use '[[:alnum:]]
' instead of '[[:digit:]]
', '[[:lower:]]
', '[[:upper:]]
', and '[[:alpha:]]
'.combinePunctuation
: Use '[[:punct:]]
' instead of leaving punctuation characters as is.combineSpace
: Use '[[:space:]]
' instead of leaving space characters as is.Some examples of these arguments are below:
inverseRegex('1aA')
#> [1] "[[:digit:]][[:lower:]][[:upper:]]"
inverseRegex('1aA', combineCases = TRUE)
#> [1] "[[:digit:]][[:alpha:]]{2}"
inverseRegex('1aA', combineAlphanumeric = TRUE)
#> [1] "[[:alnum:]]{3}"
inverseRegex('Hello World!')
#> [1] "[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"
inverseRegex('Hello World!', combineSpace = TRUE, combinePunctuation = TRUE)
#> [1] "[[:upper:]][[:lower:]]{4}[[:space:]][[:upper:]][[:lower:]]{4}[[:punct:]]"
Users can also specify the different run lengths that will be identified. The inverseRegex
function has an argument called numbersToKeep
which allows the user to specify what lengths of repeated sequences should be identified explicitly. The default value is c(2, 3, 4, 5, 10)
. Run lengths not requested will be identified with a +
.
inverseRegex('abcd1234567')
#> [1] "[[:lower:]]{4}[[:digit:]]+"
inverseRegex('abcd1234567', numbersToKeep = NULL)
#> [1] "[[:lower:]]+[[:digit:]]+"
inverseRegex('abcd1234567', numbersToKeep = 1:10)
#> [1] "[[:lower:]]{4}[[:digit:]]{7}"
Many objects with a class other than character
are supported, including logical
, integer
, numeric
, Date
, POSIXct
, factor
, matrix
, data.frame
, and tibble
. They are all (except logical
) converted to characters first and then the collection of regex patterns returned either as character vectors or as the same class as the input object if it was a matrix, data frame, or tibble. See ?inverseRegex
for a full description of how they are treated. If users need a different character conversion method they can do it themselves prior to calling inverseRegex
.
Special mention of numerics and data frames will be given here:
numeric
An attempt has been made to convert numeric values into characters as directly as possible without losing or adding any information. When passed a numeric vector inverseRegex
will convert it to character using: vapply(x, format, character(1), nsmall = 1)
. This will force at least one decimal place for all entries but will not add extra decimal places beyond that unless they were present in the individual input element; it will however remove trailing decimal zeros. For example:
vapply(c(1, 1.0, 1.10, 1.12, 1.123), format, character(1), nsmall = 1)
#> [1] "1.0" "1.0" "1.1" "1.12" "1.123"
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123), numbersToKeep = 2:10)
#> [1] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]"
#> [3] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"
## Vectors of class integer are just converted using as.character.
inverseRegex(1L)
#> [1] "[[:digit:]]"
Numerics are treated differently if they are present in a matrix, data frame, or tibble. In the case of a matrix if it has a mode of numeric then the entire object will be converted to character using trimws(format(x))
. For data frames and tibbles each column of type numeric will be converted using trimws(format(x))
. This means that unlike for numeric vectors described above, all numeric entries in matrices, data frames, and tibbles will have the same number of decimal places.
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123))
#> [1] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]"
#> [3] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"
inverseRegex(data.frame(a = c(1, 1.0, 1.10, 1.12, 1.123)))
#> a
#> 1 [[:digit:]].[[:digit:]]{3}
#> 2 [[:digit:]].[[:digit:]]{3}
#> 3 [[:digit:]].[[:digit:]]{3}
#> 4 [[:digit:]].[[:digit:]]{3}
#> 5 [[:digit:]].[[:digit:]]{3}
data.frame
When giving a data frame inverseRegex
will return a data frame of similar dimensions with each column representing an individual call to inverseRegex.
unique(inverseRegex(iris, numbersToKeep = 2:10))
#> Sepal.Length Sepal.Width
#> 1 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 51 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 101 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> Petal.Length Petal.Width Species
#> 1 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:lower:]]{6}
#> 51 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:lower:]]{10}
#> 101 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:lower:]]{9}
One of the main use cases of the package is to identify irregular entries in a dataset. To this end there is a function occurrencesLessThan
which will call inverseRegex
and return logical values with TRUE
giving the location of any regex patterns that occur less than a certain number of times.
occurrencesLessThan(c(LETTERS, 1))
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE TRUE
## When called on a data frame occurrencesLessThan will assess each column individually.
x <- iris
x$Species <- as.character(x$Species)
x[27, 'Species'] <- 'set0sa'
unique(occurrencesLessThan(x))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 FALSE FALSE FALSE FALSE FALSE
#> 27 FALSE FALSE FALSE FALSE TRUE
What constitutes a “rare” pattern can be specified with the fraction
or n
arguments. See ?occurrencesLessThan
for a full description.