inverseRegex

The inverseRegex package allows users to reverse engineer regular expression patterns for R objects. Individual characters that make up an object are categorised into common groups and encoded into run-lengths. For example, the phrase “Hello World!” can be translated to "[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!".

This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a “nnnn-nn-nn” format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example).

Usage

The main function to use is inverseRegex(x) which will identify the different characters that make up the input object x. The different groups that can be identified are

'[[:digit:]]'
'[[:lower:]]'
'[[:upper:]]'
'[[:alpha:]]'
'[[:alnum:]]'
'[[:space:]]'
'[[:punct:]]'

See ?regex for an explanation of their meanings.

By default the only groups that will be identified are [[:digit:]], [[:upper:]], and [[:lower:]], with any other characters being left as is. This can altered with the following arguments:

combineCases: Use '[[:alpha:]]' instead of '[[:lower:]]' and '[[:upper:]]'.
combineAlphanumeric: Use '[[:alnum:]]' instead of '[[:digit:]]', '[[:lower:]]', '[[:upper:]]', and '[[:alpha:]]'.
combinePunctuation: Use '[[:punct:]]' instead of leaving punctuation characters as is.
combineSpace: Use '[[:space:]]' instead of leaving space characters as is.

Some examples of these arguments are below:

inverseRegex('1aA')
#> [1] "[[:digit:]][[:lower:]][[:upper:]]"
inverseRegex('1aA', combineCases = TRUE)
#> [1] "[[:digit:]][[:alpha:]]{2}"
inverseRegex('1aA', combineAlphanumeric = TRUE)
#> [1] "[[:alnum:]]{3}"
inverseRegex('Hello World!')
#> [1] "[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"
inverseRegex('Hello World!', combineSpace = TRUE, combinePunctuation = TRUE)
#> [1] "[[:upper:]][[:lower:]]{4}[[:space:]][[:upper:]][[:lower:]]{4}[[:punct:]]"

Users can also specify the different run lengths that will be identified. The inverseRegex function has an argument called numbersToKeep which allows the user to specify what lengths of repeated sequences should be identified explicitly. The default value is c(2, 3, 4, 5, 10). Run lengths not requested will be identified with a +.

inverseRegex('abcd1234567')
#> [1] "[[:lower:]]{4}[[:digit:]]+"
inverseRegex('abcd1234567', numbersToKeep = NULL)
#> [1] "[[:lower:]]+[[:digit:]]+"
inverseRegex('abcd1234567', numbersToKeep = 1:10)
#> [1] "[[:lower:]]{4}[[:digit:]]{7}"

Non-character Inputs

Many objects with a class other than character are supported, including logical, integer, numeric, Date, POSIXct, factor, matrix, data.frame, and tibble. They are all (except logical) converted to characters first and then the collection of regex patterns returned either as character vectors or as the same class as the input object if it was a matrix, data frame, or tibble. See ?inverseRegex for a full description of how they are treated. If users need a different character conversion method they can do it themselves prior to calling inverseRegex.

Special mention of numerics and data frames will be given here:

Inputs of Class `numeric`

An attempt has been made to convert numeric values into characters as directly as possible without losing or adding any information. When passed a numeric vector inverseRegex will convert it to character using: vapply(x, format, character(1), nsmall = 1). This will force at least one decimal place for all entries but will not add extra decimal places beyond that unless they were present in the individual input element; it will however remove trailing decimal zeros. For example:

vapply(c(1, 1.0, 1.10, 1.12, 1.123), format, character(1), nsmall = 1)
#> [1] "1.0"   "1.0"   "1.1"   "1.12"  "1.123"
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123), numbersToKeep = 2:10)
#> [1] "[[:digit:]].[[:digit:]]"    "[[:digit:]].[[:digit:]]"   
#> [3] "[[:digit:]].[[:digit:]]"    "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"

## Vectors of class integer are just converted using as.character.
inverseRegex(1L)
#> [1] "[[:digit:]]"

Numerics are treated differently if they are present in a matrix, data frame, or tibble. In the case of a matrix if it has a mode of numeric then the entire object will be converted to character using trimws(format(x)). For data frames and tibbles each column of type numeric will be converted using trimws(format(x)). This means that unlike for numeric vectors described above, all numeric entries in matrices, data frames, and tibbles will have the same number of decimal places.

inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123))
#> [1] "[[:digit:]].[[:digit:]]"    "[[:digit:]].[[:digit:]]"   
#> [3] "[[:digit:]].[[:digit:]]"    "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"
inverseRegex(data.frame(a = c(1, 1.0, 1.10, 1.12, 1.123)))
#>                            a
#> 1 [[:digit:]].[[:digit:]]{3}
#> 2 [[:digit:]].[[:digit:]]{3}
#> 3 [[:digit:]].[[:digit:]]{3}
#> 4 [[:digit:]].[[:digit:]]{3}
#> 5 [[:digit:]].[[:digit:]]{3}

Inputs of Class `data.frame`

When giving a data frame inverseRegex will return a data frame of similar dimensions with each column representing an individual call to inverseRegex.

unique(inverseRegex(iris, numbersToKeep = 2:10))
#>                Sepal.Length             Sepal.Width
#> 1   [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 51  [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 101 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#>                Petal.Length             Petal.Width         Species
#> 1   [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]  [[:lower:]]{6}
#> 51  [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:lower:]]{10}
#> 101 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]  [[:lower:]]{9}

Identifying Rare Patterns

One of the main use cases of the package is to identify irregular entries in a dataset. To this end there is a function occurrencesLessThan which will call inverseRegex and return logical values with TRUE giving the location of any regex patterns that occur less than a certain number of times.

occurrencesLessThan(c(LETTERS, 1))
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE  TRUE

## When called on a data frame occurrencesLessThan will assess each column individually.
x <- iris
x$Species <- as.character(x$Species)
x[27, 'Species'] <- 'set0sa'
unique(occurrencesLessThan(x))
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1         FALSE       FALSE        FALSE       FALSE   FALSE
#> 27        FALSE       FALSE        FALSE       FALSE    TRUE

What constitutes a “rare” pattern can be specified with the fraction or n arguments. See ?occurrencesLessThan for a full description.