Data Validation Rules

Duncan Garmonsway

2022-09-01

What data validation rules are

Data validation rules control what constants can be entered into a cell, e.g. any whole number between 0 and 9, or one of several values from another part of the spreadsheet.

‘xlsx_validation()’ returns each of the data validation rules in an xlsx file, and the ranges of cells to which each rule applies.

Here is a rule that restricts input to integers between 0 and 9 inclusive, or no value (blank). If any other value is attempted, then an error message is displayed with the imaginative title “message title”, the informative body text “message body”, and a “stop” symbol.

library(tidyxl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
examples <- system.file("extdata/examples.xlsx", package = "tidyxl")
glimpse(xlsx_validation(examples)[1, ])
#> Rows: 1
#> Columns: 14
#> $ sheet              <chr> "Sheet1"
#> $ ref                <chr> "A106"
#> $ type               <chr> "whole"
#> $ operator           <chr> "between"
#> $ formula1           <chr> "0"
#> $ formula2           <chr> "9"
#> $ allow_blank        <lgl> TRUE
#> $ show_input_message <lgl> TRUE
#> $ prompt_title       <chr> "message title"
#> $ prompt_body        <chr> "message body"
#> $ show_error_message <lgl> TRUE
#> $ error_title        <chr> "error title"
#> $ error_body         <chr> "error body"
#> $ error_symbol       <chr> "stop"

The gamut of possible rules is given in the examples for xlsx_validation().

as.data.frame(xlsx_validation(examples))
#>     sheet            ref       type           operator            formula1
#> 1  Sheet1           A106      whole            between                   0
#> 2  Sheet1           A108       list               <NA>              $B$108
#> 3  Sheet1           A110       date            between 2017-01-01 00:00:00
#> 4  Sheet1           A111       time            between            00:00:00
#> 5  Sheet1           A112 textLength            between                   0
#> 6  Sheet1           A114      whole         notBetween                   0
#> 7  Sheet1 A115,A121:A122      whole              equal                   0
#> 8  Sheet1           A116      whole           notEqual                   0
#> 9  Sheet1           A117      whole        greaterThan                   0
#> 10 Sheet1           A119      whole greaterThanOrEqual                   0
#> 11 Sheet1           A120      whole    lessThanOrEqual                   0
#> 12 Sheet1           A118      whole           lessThan                   0
#> 13 Sheet1           A107    decimal         notBetween                   0
#> 14 Sheet1           A113     custom               <NA>     A113<=LEN(B113)
#> 15 Sheet1           A109       list               <NA>              $B$108
#>               formula2 allow_blank show_input_message  prompt_title
#> 1                    9        TRUE               TRUE message title
#> 2                 <NA>        TRUE               TRUE          <NA>
#> 3  2017-01-09 09:00:00        TRUE               TRUE          <NA>
#> 4             09:00:00        TRUE               TRUE          <NA>
#> 5                    9        TRUE               TRUE          <NA>
#> 6                    9        TRUE               TRUE          <NA>
#> 7                 <NA>        TRUE               TRUE          <NA>
#> 8                 <NA>        TRUE               TRUE          <NA>
#> 9                 <NA>        TRUE               TRUE          <NA>
#> 10                <NA>        TRUE               TRUE          <NA>
#> 11                <NA>        TRUE               TRUE          <NA>
#> 12                <NA>        TRUE               TRUE          <NA>
#> 13                   9       FALSE              FALSE          <NA>
#> 14                <NA>        TRUE               TRUE          <NA>
#> 15                <NA>        TRUE               TRUE          <NA>
#>     prompt_body show_error_message error_title error_body error_symbol
#> 1  message body               TRUE error title error body         stop
#> 2          <NA>               TRUE        <NA>       <NA>      warning
#> 3          <NA>               TRUE        <NA>       <NA>         stop
#> 4          <NA>               TRUE        <NA>       <NA>         stop
#> 5          <NA>               TRUE        <NA>       <NA>         stop
#> 6          <NA>               TRUE        <NA>       <NA>         stop
#> 7          <NA>               TRUE        <NA>       <NA>         stop
#> 8          <NA>               TRUE        <NA>       <NA>         stop
#> 9          <NA>               TRUE        <NA>       <NA>         stop
#> 10         <NA>               TRUE        <NA>       <NA>         stop
#> 11         <NA>               TRUE        <NA>       <NA>         stop
#> 12         <NA>               TRUE        <NA>       <NA>         stop
#> 13         <NA>              FALSE        <NA>       <NA>         stop
#> 14         <NA>               TRUE        <NA>       <NA>         stop
#> 15         <NA>               TRUE        <NA>       <NA>  information

Joining rules to cells

There are no built-in functions for joining ranges like A1:D5,G8 to single cells like B3. For now, use the snippets in this section. In future I might develop a dplyr-like join function (this is hard currently because dplyr doesn’t yet join on arbitrary functions, or even the standard inequalities like >=). Help and advice would be gratefully accepted!

To join rules to cells, a naive method is to use the sheet and ref columns to match the sheet and address columns to the output of xlsx_cells().

rules <- xlsx_validation(examples)
cells <- filter(xlsx_cells(examples), row >= 106, col == 1)

rules
#> # A tibble: 15 × 14
#>    sheet  ref      type  opera…¹ formu…² formu…³ allow…⁴ show_…⁵ promp…⁶ promp…⁷
#>    <chr>  <chr>    <chr> <chr>   <chr>   <chr>   <lgl>   <lgl>   <chr>   <chr>  
#>  1 Sheet1 A106     whole between 0       9       TRUE    TRUE    messag… messag…
#>  2 Sheet1 A108     list  <NA>    $B$108  <NA>    TRUE    TRUE    <NA>    <NA>   
#>  3 Sheet1 A110     date  between 2017-0… 2017-0… TRUE    TRUE    <NA>    <NA>   
#>  4 Sheet1 A111     time  between 00:00:… 09:00:… TRUE    TRUE    <NA>    <NA>   
#>  5 Sheet1 A112     text… between 0       9       TRUE    TRUE    <NA>    <NA>   
#>  6 Sheet1 A114     whole notBet… 0       9       TRUE    TRUE    <NA>    <NA>   
#>  7 Sheet1 A115,A1… whole equal   0       <NA>    TRUE    TRUE    <NA>    <NA>   
#>  8 Sheet1 A116     whole notEqu… 0       <NA>    TRUE    TRUE    <NA>    <NA>   
#>  9 Sheet1 A117     whole greate… 0       <NA>    TRUE    TRUE    <NA>    <NA>   
#> 10 Sheet1 A119     whole greate… 0       <NA>    TRUE    TRUE    <NA>    <NA>   
#> 11 Sheet1 A120     whole lessTh… 0       <NA>    TRUE    TRUE    <NA>    <NA>   
#> 12 Sheet1 A118     whole lessTh… 0       <NA>    TRUE    TRUE    <NA>    <NA>   
#> 13 Sheet1 A107     deci… notBet… 0       9       FALSE   FALSE   <NA>    <NA>   
#> 14 Sheet1 A113     cust… <NA>    A113<=… <NA>    TRUE    TRUE    <NA>    <NA>   
#> 15 Sheet1 A109     list  <NA>    $B$108  <NA>    TRUE    TRUE    <NA>    <NA>   
#> # … with 4 more variables: show_error_message <lgl>, error_title <chr>,
#> #   error_body <chr>, error_symbol <chr>, and abbreviated variable names
#> #   ¹​operator, ²​formula1, ³​formula2, ⁴​allow_blank, ⁵​show_input_message,
#> #   ⁶​prompt_title, ⁷​prompt_body
cells
#> # A tibble: 93 × 24
#>    sheet  address   row   col is_blank content     data_…¹ error logical numeric
#>    <chr>  <chr>   <int> <int> <lgl>    <chr>       <chr>   <chr> <lgl>     <dbl>
#>  1 Sheet1 A106      106     1 FALSE    0           numeric <NA>  NA          0  
#>  2 Sheet1 A107      107     1 FALSE    0.1         numeric <NA>  NA          0.1
#>  3 Sheet1 A108      108     1 FALSE    137         charac… <NA>  NA         NA  
#>  4 Sheet1 A109      109     1 FALSE    137         charac… <NA>  NA         NA  
#>  5 Sheet1 A110      110     1 FALSE    42736       date    <NA>  NA         NA  
#>  6 Sheet1 A111      111     1 FALSE    0.35416666… date    <NA>  NA         NA  
#>  7 Sheet1 A112      112     1 FALSE    149         charac… <NA>  NA         NA  
#>  8 Sheet1 A113      113     1 FALSE    10          numeric <NA>  NA         10  
#>  9 Sheet1 A114      114     1 FALSE    -1          numeric <NA>  NA         -1  
#> 10 Sheet1 A115      115     1 FALSE    0           numeric <NA>  NA          0  
#> # … with 83 more rows, 14 more variables: date <dttm>, character <chr>,
#> #   character_formatted <list>, formula <chr>, is_array <lgl>,
#> #   formula_ref <chr>, formula_group <int>, comment <chr>, height <dbl>,
#> #   width <dbl>, row_outline_level <dbl>, col_outline_level <dbl>,
#> #   style_format <chr>, local_format_id <int>, and abbreviated variable name
#> #   ¹​data_type

inner_join(rules, cells, by = c("sheet" = "sheet", "ref" = "address"))
#> # A tibble: 9 × 36
#>   sheet  ref   type      opera…¹ formu…² formu…³ allow…⁴ show_…⁵ promp…⁶ promp…⁷
#>   <chr>  <chr> <chr>     <chr>   <chr>   <chr>   <lgl>   <lgl>   <chr>   <chr>  
#> 1 Sheet1 A106  whole     between 0       9       TRUE    TRUE    messag… messag…
#> 2 Sheet1 A108  list      <NA>    $B$108  <NA>    TRUE    TRUE    <NA>    <NA>   
#> 3 Sheet1 A110  date      between 2017-0… 2017-0… TRUE    TRUE    <NA>    <NA>   
#> 4 Sheet1 A111  time      between 00:00:… 09:00:… TRUE    TRUE    <NA>    <NA>   
#> 5 Sheet1 A112  textLeng… between 0       9       TRUE    TRUE    <NA>    <NA>   
#> 6 Sheet1 A114  whole     notBet… 0       9       TRUE    TRUE    <NA>    <NA>   
#> 7 Sheet1 A107  decimal   notBet… 0       9       FALSE   FALSE   <NA>    <NA>   
#> 8 Sheet1 A113  custom    <NA>    A113<=… <NA>    TRUE    TRUE    <NA>    <NA>   
#> 9 Sheet1 A109  list      <NA>    $B$108  <NA>    TRUE    TRUE    <NA>    <NA>   
#> # … with 26 more variables: show_error_message <lgl>, error_title <chr>,
#> #   error_body <chr>, error_symbol <chr>, row <int>, col <int>, is_blank <lgl>,
#> #   content <chr>, data_type <chr>, error <chr>, logical <lgl>, numeric <dbl>,
#> #   date <dttm>, character <chr>, character_formatted <list>, formula <chr>,
#> #   is_array <lgl>, formula_ref <chr>, formula_group <int>, comment <chr>,
#> #   height <dbl>, width <dbl>, row_outline_level <dbl>,
#> #   col_outline_level <dbl>, style_format <chr>, local_format_id <int>, and …

Notice that only 9 cells were joined, even though 15 rules were defined. Surely at least 15 cells ought to be joined? The reason why they are not is that the cells for the other 6 rules don’t exist – rules can be defined for cells that have no value, and cells with no value are not returned by xlsx_cells(), otherwise all 17179869184 cells in a worksheet must be returned.

A more subtle reason for certain cells not to have joined successfully is that the ref column of the rules sometimes refers to more than one cell, and can even refer to several, non-contiguous ranges of cells. Specifically, the seventh rule’s ref column has A115,A121:A122.

Special treatment is needed here. Ideally, some kind of join function would be defined that can compare indidual cells with ranges. But I haven’t written one, so what follows is a workaround. First, the two ranges of cells must be unnested into A115 and A121:122. Then the range A121:122 must be ‘unranged’ into A121 and A122.

unrange <- function(x) {
  limits <- cellranger::as.cell_limits(x)
  rows <- seq(limits$ul[1], limits$lr[1])
  cols <- seq(limits$ul[2], limits$lr[2])
  rowcol <- expand.grid(rows, cols)
  cell_addrs <- cellranger::cell_addr(rowcol[[1]], rowcol[[2]])
  cellranger::to_string(cell_addrs, fo = "A1", strict = FALSE)
}

unnest_ref <- function(x, ref) {
  UseMethod("unnest_ref")
}

unnest_ref.default <- function(x, ref_col = ref) {
  stopifnot(is.character(x), length(x) == 1L)
  refs <- unlist(strsplit(x, ",", fixed = TRUE))
  unlist(lapply(refs, unrange))
}

unrange("A121:A122")
#> [1] "A121" "A122"
unnest_ref("A115,A121:A122")
#> [1] "A115" "A121" "A122"

The unnest_ref() function can also be defined for whole data frames, unnesting them by a column of references.

unnest_ref.data.frame <- function(x, ref_col) {
  ref <- rlang::enquo(ref_col)
  x[[rlang::quo_name(ref)]] <- lapply(x[[rlang::quo_name(ref)]], unnest_ref)
  tidyr::unnest(x, rlang::UQ(ref))
}

(nested_rule <- slice(rules, 7))
#> # A tibble: 1 × 14
#>   sheet  ref       type  opera…¹ formu…² formu…³ allow…⁴ show_…⁵ promp…⁶ promp…⁷
#>   <chr>  <chr>     <chr> <chr>   <chr>   <chr>   <lgl>   <lgl>   <chr>   <chr>  
#> 1 Sheet1 A115,A12… whole equal   0       <NA>    TRUE    TRUE    <NA>    <NA>   
#> # … with 4 more variables: show_error_message <lgl>, error_title <chr>,
#> #   error_body <chr>, error_symbol <chr>, and abbreviated variable names
#> #   ¹​operator, ²​formula1, ³​formula2, ⁴​allow_blank, ⁵​show_input_message,
#> #   ⁶​prompt_title, ⁷​prompt_body
unnest_ref(nested_rule, ref)
#> Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
#> Please use the non-prefixed form or `!!` instead.
#> 
#>   # Bad:
#>   rlang::expr(mean(rlang::UQ(var) * 100))
#> 
#>   # Ok:
#>   rlang::expr(mean(UQ(var) * 100))
#> 
#>   # Good:
#>   rlang::expr(mean(!!var * 100))
#> 
#> This warning is displayed once per session.
#> # A tibble: 3 × 14
#>   sheet  ref   type  operator formula1 formula2 allow_…¹ show_…² promp…³ promp…⁴
#>   <chr>  <chr> <chr> <chr>    <chr>    <chr>    <lgl>    <lgl>   <chr>   <chr>  
#> 1 Sheet1 A115  whole equal    0        <NA>     TRUE     TRUE    <NA>    <NA>   
#> 2 Sheet1 A121  whole equal    0        <NA>     TRUE     TRUE    <NA>    <NA>   
#> 3 Sheet1 A122  whole equal    0        <NA>     TRUE     TRUE    <NA>    <NA>   
#> # … with 4 more variables: show_error_message <lgl>, error_title <chr>,
#> #   error_body <chr>, error_symbol <chr>, and abbreviated variable names
#> #   ¹​allow_blank, ²​show_input_message, ³​prompt_title, ⁴​prompt_body

Finally the new data frame of rules can be joined to a data frame of cells in any of the usual ways, via the sheet and ref columns.

Problems with this approach occur with rules that are defined over large ranges of cells: the ‘unnesting’ of those ranges results in very long vectors of individual cell addresses, or (worse) huge data frames of rules. Such cases are commonplace, because rules are often defined for entire columns of a spreadsheet, and a column has 1048576 rows.