Case study: calculating geometric means

By far the most common use of assertions is for checking input to functions. Consider this function for calculating the geometric mean:

geomean <- function(x, na.rm = FALSE)
{
  exp(mean(log(x), na.rm = na.rm))
}

In a statically-typed language, we could enforce x being a numeric vector. R’s dynamic typing (while mostly helping us be more productive) gives us some rope to hang ourselves with: x and na.rm can be absolutely anything. We need to handle the cases when x is not numeric, or when x contains negative values, or when na.rm is not a single logical value.

The built-in functions exp, mean and log have some of their own logic for handling bad inputs, and it is possible to simply rely on that logic rather than writing your own. For demonstration purposes, let’s see how they behave, and then see if we can improve upon it.

If we pass a non-numeric x, we see:

geomean("a")
## Error in log(x): non-numeric argument to mathematical function

The error message is OK, but it since it is appearing to come from log(x), it isn’t totally clear to the user where the problem originates. The assertive fix isto include assert_is_numeric(x) in the function.

Where should this line go? In accordance with the first clause of the programming principle “fail early, fail often”, the assertion belongs at the start of the function.

geomean2 <- function(x, na.rm = FALSE)
{
  assert_is_numeric(x)
  exp(mean(log(x), na.rm = na.rm))
}
geomean2("a")
## Error in geomean2("a"): is_numeric : x is not of class 'numeric'; it has class 'character'.

The geometric mean doesn’t make any mathematical sense for (real) negative numbers, and will return NaN if the input contains any.

geomean(rnorm(20))
## Warning in log(x): NaNs produced
## [1] NaN

The warning here is not so informative (why were the NaNs produced?), and appears to come from log(x). We could be strict and throw an error if there are negative values by adding a call to assert_all_are_non_negative. To replicate the base-R behaviour we can define custom actions based upon the result of is_non_negative:

geomean3 <- function(x, na.rm = FALSE)
{
  assert_is_numeric(x)
  if(!all(is_non_negative(x), na.rm = TRUE)) # Don't worry about NAs here
  {
    warning("x contains negative values, so the geometric mean makes no sense.")
    return(NaN)
  }
  exp(mean(log(x), na.rm = na.rm))
}
geomean3(rnorm(20))
## Warning in geomean3(rnorm(20)): x contains negative values, so the geometric
## mean makes no sense.
## [1] NaN

For na.rm, the mean function coerces its input to be a logical value, warning if the value’s length is more than one.

x <- rlnorm(20)
x[sample(20, 5)] <- NA
geomean(x, c(1.5, 0))
## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used
## [1] 1.428326

The warning about the length is OK, but again its source is not totally clear for users. The coercion to logical happens silently, which isn’t ideal.

Again, we could be strict and throw an error when na.rm isn’t a scalar logical value using assert_is_a_bool. (This is a compound assertion checking both the type and the length of the object). In this case, to replicate the base-R behaviour, we will use some utility functions provided by assertive.

geomean4 <- function(x, na.rm = FALSE)
{
  assert_is_numeric(x)
  if(!all(is_non_negative(x), na.rm = TRUE)) # Don't worry about NAs here
  {
    warning("x contains negative values, so the geometric mean makes no sense.")
    return(NaN)
  }
  na.rm <- coerce_to(use_first(na.rm), "logical")
  exp(mean(log(x), na.rm = na.rm))
}

use_first takes the first element of an object, warning if it has length more than one. coerce_to checks and object’s class, then converts it to the requested type, with a warning, using an appropriate as.* function if it exists, or as if it doesn’t.

geomean4(x, c(1.5, 0))
## Warning: Only the first value of na.rm (= 1.5) will be used.
## Warning: Coercing use_first(na.rm) to class 'logical'.
## [1] 1.428326

Whether to throw an error on bad input or fix it is personal preference and depends upon context. For end-user functions should should usually try to be flexible and fix things unless they’ve done something very silly. For lower-level functions, you can often afford to be stricter.

Summary

  1. Check function inputs at the start of the function to “fail early”.
  2. Don’t be afriad of having lots of assertions to “fail often”.
  3. Use assert_ functions to throw errors for inputs that you can’t salvage.
  4. Use is_ functions in an if condition to provide custom behaviour.
  5. Fix inputs with use_first and coerce_to where possible.

Exercises

  1. The mad function in the stats package calculates the median absolute deviation. Type mad to see its contents, and run example(mad) to get a feel for how it works.
    Update the function (either copy and paste, or use fixInNamespace if you are feeling fancy) to include some assertions checking the inputs. Hint: Some of the inputs should be numeric; others should be logical. Some should be only allow a single value. [15 mins].