cleaner

cleaner: Fast and Easy Data Cleaning

Website of this package: https://msberends.github.io/cleaner

CRAN_Badge

The small R package for cleaning and checking data columns in a fast and easy way. Relying on very few dependencies, it provides smart guessing, but with user options to override anything if needed.

It also provides two new data types that are not available in base R: currency and percentage.


Contents:


Why this package

As a data scientist, I’m often served with data that is not clean, not tidy and consequently not ready for analysis at all. For tidying data, there’s of course the tidyverse (https://www.tidyverse.org), which lets you manipulate data in any way you can think of. But for cleaning, I think our community was still lacking a neat solution that makes data cleaning fast and easy with functions that kind of ‘think on their own’ to do that.

If the CRAN button at the top of this page is green, install the package with:

install.packages("cleaner")

Otherwise, or if you are looking for the latest stable development version, install the package with:

install.packages("remotes") # if you haven't already
remotes::install_github("msberends/cleaner")

How it works

This package provides two types of functions: cleaning and checking.

Cleaning

Use clean() to clean data. It guesses what kind of data class would best fit your input data. It calls any of the following functions, that can also be used independently. They always return the class from the function name (e.g. clean_Date() always returns class Date).

Other cleaning

Checking

The easiest and most comprehensive way to check the data of a column/variable is to create frequency tables. Use freq() to do this. It supports a lot of different classes (types of data), weights, and is even extendible by other packages. In markdown documents (like this README file), it formats as real markdown.

freq(unclean$gender)

Frequency table

Class: character
Length: 500
Available: 500 (100%, NA: 0 = 0%)
Unique: 5

Shortest: 1
Longest: 6

Item Count Percent Cum. Count Cum. Percent
1 male 240 48.0% 240 48.0%
2 female 220 44.0% 460 92.0%
3 man 22 4.4% 482 96.4%
4 m 15 3.0% 497 99.4%
5 F 3 0.6% 500 100.0%

Clean it and check again (using markdown = FALSE to show how it would look in the R console):

freq(clean_factor(unclean$gender, 
                  levels = c("^m" = "Male", "^f" = "Female")),
     markdown = FALSE)
#> Frequency table 
#> 
#> Class:      factor (numeric)
#> Length:     500
#> Levels:     2: Male, Female
#> Available:  500 (100%, NA: 0 = 0%)
#> Unique:     2
#> 
#>      Item      Count   Percent   Cum. Count   Cum. Percent
#> ---  -------  ------  --------  -----------  -------------
#> 1    Male        277     55.4%          277          55.4%
#> 2    Female      223     44.6%          500         100.0%

This could also have been done with dplyr syntax, since freq() supports tidy evaluation:

unclean %>% 
  freq(clean_factor(gender,
                    levels = c("^m" = "Male", "^f" = "Female")))
# or:
unclean %>% 
  pull(gender) %>% 
  clean_factor(c("^m" = "Male", "^f" = "Female")) %>% 
  freq()

Speed

The cleaning functions are tremendously fast, because they rely on R’s own internal C++ libraries:

# Create a vector with 500,000 items
n <- 500000
values <- paste0(sample(c("yes", "no"), n, replace = TRUE), 
                 as.integer(runif(n, 0, 10000)))

# data looks like:
values[1:3]
#> [1] "no3697"  "yes1906" "yes6738"

clean_logical(values[1:3])
#> [1] FALSE  TRUE  TRUE

clean_character(values[1:3])
#> [1] "no"  "yes" "yes"

clean_numeric(values[1:3])
#> [1] 3697 1906 6738

# benchmark the cleaning based on 10 runs and show it in seconds:
microbenchmark::microbenchmark(logical = clean_logical(values),
                               character = clean_character(values),
                               numeric = clean_numeric(values),
                               times = 10,
                               unit = "s")
#> Unit: seconds
#> expr            min        lq      mean    median        uq       max neval
#> logical   0.2846163 0.2925479 0.3076008 0.3100244 0.3189712 0.3269428    10
#> character 0.4522698 0.4593437 0.4734631 0.4636837 0.4888959 0.5303473    10
#> numeric   0.6428362 0.6476207 0.6618845 0.6542312 0.6778215 0.6897005    10

Cleaning 500,000 values (!) only takes 0.3-0.6 seconds on our system.

Invalid regular expressions

If invalid regular expressions are used, the cleaning functions will not throw errors, but instead will show a warning and will interpret the expression as a fixed value:

clean_character("0123test 0123[a-b] ")
#> [1] "test ab"

clean_character("0123test 0123[a-b] ", remove = "[a-b]")
#> [1] "0123test 0123[-]"

clean_character("0123test0123", remove = "[a-b")
#> [1] "0123test 0123]"
#> Warning message:
#> invalid regular expression '[a-b', reason 'Missing ']'' - now interpreting as fixed value