library(framecleaner)
library(dplyr)
Data imported from excel and csv in business situations can have messy characteristics and data formats. This package provides functions to tidy your data frame using the power of tidyselect
.
create sample data
::tibble(
tibbledate = c("20190101", "20190305", "20201012"),
numeric_val = c(1, NA, 5),
char_val = c("", " val ", "-")
-> sample_table
)
sample_table
## # A tibble: 3 × 3
## date numeric_val char_val
## <chr> <dbl> <chr>
## 1 20190101 1 ""
## 2 20190305 NA " val "
## 3 20201012 5 "-"
Data occasionally has different ways to represent NA values. set_na
checks as default c("-", "", " ", "null")
but any values can be supplied to automatically be set to NA. This is helpful when you want to check the NA profile of a data frame using validata::diagnose
%>%
sample_table make_na()
## # A tibble: 3 × 3
## date numeric_val char_val
## <chr> <dbl> <chr>
## 1 20190101 1 <NA>
## 2 20190305 NA " val "
## 3 20201012 5 <NA>
remove whitespace from the ends of character variables that may be otherwise undetectable by inspection.
%>%
sample_table remove_whitespace()
## # A tibble: 3 × 3
## date numeric_val char_val
## <chr> <dbl> <chr>
## 1 20190101 1 ""
## 2 20190305 NA "val"
## 3 20201012 5 "-"
automatically convert character columns that should be dates.
%>%
sample_table set_date()
## # A tibble: 3 × 3
## date numeric_val char_val
## <date> <dbl> <chr>
## 1 2019-01-01 1 ""
## 2 2019-03-05 NA " val "
## 3 2020-10-12 5 "-"
relocates an unorganized dataframe using heuristics such as putting character and date columns first, and organizing by alphabetical order.
%>%
sample_table relocate_all()
## # A tibble: 3 × 3
## date char_val numeric_val
## <chr> <chr> <dbl>
## 1 20190101 "" 1
## 2 20190305 " val " NA
## 3 20201012 "-" 5
Wrapper function to apply all cleaning operations to a data frame using sensible defaults.
%>%
sample_table clean_frame()
## # A tibble: 3 × 3
## DATE NUMERIC_VAL CHAR_VAL
## <int> <int> <chr>
## 1 20190101 1 <NA>
## 2 20190305 NA " val "
## 3 20201012 5 <NA>
use tidyselect to fill NAs with a single value
%>%
sample_table fill_na()
## # A tibble: 3 × 3
## date numeric_val char_val
## <chr> <dbl> <chr>
## 1 20190101 1 ""
## 2 20190305 0 " val "
## 3 20201012 5 "-"