library(campfin)
packageVersion("campfin")
#> [1] '1.0.9'
library(dplyr)
library(readr)
library(stringr)
This vignette contains an example of the workflow used by the wranglers on The Accountability Project, a tool created by The Investigative Reporting Workshop in Washington, DC. The Accountability Project curates, cleans, and indexes public data to give journalists, researchers, and the public a simple way to search across otherwise siloed records. The data focuses on people, organizations and locations.
Since state-level campaign finance data is typically reported by the campaigns and provided to them by the contributors themselves, there is often great disparity in data quality. The campfin package was created to reduce this disparity in a consistent, confident, and programmatic way.
In this vignette, we will clean some exaggerated fake messy
contribution data from a fictitious campaign in Vermont. This data is
found in the vt_contribs.csv
file included with our
package.
<- system.file("extdata", "vt_contribs.csv", package = "campfin") ex_file
id | cand | date | amount | name | address | city | state | zip |
---|---|---|---|---|---|---|---|---|
01 | Bob Miller | 02/09/2019 | 10 | Lisa Miller | 4 Sheffield square Road | Sheffield | VT | 05866 |
02 | Bob Miller | 03/09/2009 | 20 | Deb Brown | Requested | Requested | RE | 00000 |
03 | Chuck White | 04/09/2019 | 25 | N/A | p.o. box 567 | Midlebury | Vermont | 05753-0567 |
04 | Chuck White | 05/09/2019 | 100 | Josh Jones | sugarhouse SQU | e Corinth | VT | 5076 |
05 | Bob Miller | 02/09/2019 | 10 | Lisa Miller | 4 Sheffield square Road | Sheffield | VT | 05866 |
06 | Chuck White | 06/09/2019 | 1000 | Bob Taylor | 55 thisplace av | young america | mn | 55555 |
07 | Chuck White | 07/09/2019 | -600 | Alex Johnson | 11 Liberty AVN | Bristol, VT | VT | 99999 |
08 | Alice Walsh | 08/09/2019 | 0 | Ruth Smith | 2 Burlington sqre | Brulington | vt | 05401 |
09 | Alice Walsh | 09/09/2019 | 69 | Joe Garcia | 770 5th-st-nw | Washington | D.C. | 20001-2674 |
10 | Alice Walsh | 11/09/2019 | 222 | Dave Wilson | XXXXXXXXXXXXXXXXX | SA | Texas | 78202 |
What are some of the potential problems we can see in this data?
date
column is not parsed as an R date, making it
impossible to perform mathematical calculations like
min()
.amount
value and another that’s
zero. Not necessarily uncommon, but worth noting.name
.address
we see:
NA
.NA
.city
we see many of the same problems, plus:
state
information.state
we see a mix of full and abbreviated state
names.zip
:
NA
values (NB: 55555 is a
valid ZIP code, 99999 is not).While this data is obviously much smaller and more full of errors than real campaign finance data, these errors are not uncommon and need to be addressed. The campfin package contains many of the tools we need to first find and then fix these common problems.
In most cases, the first step is to download and read the file from a
state agency. When reading the data with the popular
readr::read_delim()
function, the
col_date_mdy()
function can be used as a quick shortcut for
readr::col_date(format = "%m/%d/%Y")
, the format most
commonly found in U.S. campaign finance data.
<- read_csv(
vt file = ex_file,
trim_ws = FALSE,
na = c("", "NA", "N/A"),
col_types = cols(
amount = col_number(),
date = col_date_mdy()
) )
We can see how the new date
column is an actual date
object, allowing for mathematical manipulation.
min(vt$date)
#> [1] "2009-03-09"
Next, we should try to normalize our data as much as possible. We can use some simple counting functions and built in vectors to check the cleanliness of our raw data.
prop_in(vt$city, str_to_lower(valid_city))
#> [1] 0.1
prop_in(vt$state, valid_state)
#> [1] 0.4
prop_in(vt$zip, valid_zip)
#> [1] 0.5
col_stats(vt, n_distinct)
#> # A tibble: 9 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 id <chr> 10 1
#> 2 cand <chr> 3 0.3
#> 3 date <date> 9 0.9
#> 4 amount <dbl> 9 0.9
#> 5 name <chr> 9 0.9
#> 6 address <chr> 9 0.9
#> 7 city <chr> 9 0.9
#> 8 state <chr> 7 0.7
#> 9 zip <chr> 9 0.9
col_stats(vt, count_na)
#> # A tibble: 9 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 id <chr> 0 0
#> 2 cand <chr> 0 0
#> 3 date <date> 0 0
#> 4 amount <dbl> 0 0
#> 5 name <chr> 1 0.1
#> 6 address <chr> 0 0
#> 7 city <chr> 0 0
#> 8 state <chr> 0 0
#> 9 zip <chr> 0 0
A typical benchmark is to reach greater than 95% valid. That is, we want to normalize our data enough that less than 5% of our data can not be easily confirmed as valid using a fairly comprehensive list of cities, states, and ZIP codes.
We will first try to reach this threshold by normalizing our data.
This process involves reducing inconsistencies through string
manipulation. There are separate normal_*()
functions for
each of the 4 types of geographic variables. Typically we use
dplyr::mutate()
to create new, normalized versions
of our messy columns, preserving the old data for transparency. Here, we
will just overwrite our example data for simplicity.
<- vt %>%
vt mutate(
address = normal_address(
address = address,
abbs = usps_street,
na = invalid_city,
na_rep = TRUE
),city = normal_city(
city = city,
abbs = usps_city,
states = "VT",
na = invalid_city
),state = normal_state(
state = state,
abbreviate = TRUE,
na_rep = TRUE,
valid = valid_state
),zip = normal_zip(
zip = zip,
na_rep = TRUE
) )
We can see how these functions and our built in data was used to normalize the geographic contributor data and remove anything that didn’t present real information. This format is much more easily explored and search.
#> # A tibble: 10 × 4
#> address city state zip
#> <chr> <chr> <chr> <chr>
#> 1 4 SHEFFIELD SQUARE RD SHEFFIELD VT 05866
#> 2 <NA> <NA> <NA> <NA>
#> 3 PO BOX 567 MIDLEBURY VT 05753
#> 4 SUGARHOUSE SQ EAST CORINTH VT 5076
#> 5 4 SHEFFIELD SQUARE RD SHEFFIELD VT 05866
#> 6 55 THISPLACE AVE YOUNG AMERICA MN 55555
#> 7 11 LIBERTY AVE BRISTOL VT <NA>
#> 8 2 BURLINGTON SQ BRULINGTON VT 05401
#> 9 770 5THSTNW WASHINGTON DC 20001
#> 10 <NA> SA TX 78202
However, the problem has not been solved. City names are the most troublesome; There are so many city names and such great variety (compared to states and ZIP codes), that it can be difficult to normalize and difficult to assess.
Our valid_city
vector contains many city names, but far
less than exist in the country, especially when you account for
neighborhoods that aren’t really cities, but shouldn’t be
changed (some of these are contained in our curated
extra_city
vector).
length(valid_city)
#> [1] 19083
sample(valid_city, 6)
#> [1] "WINSTON SALEM" "BAINBRIDGE" "MIDDLE RIVER" "CROZET"
#> [5] "KNIFLEY" "ALEPPO"
sample(extra_city, 6)
#> [1] "LATHRUP VILLAGE" "BIG CHIMNEY" "MOUNT PENN"
#> [4] "CLIFFWOOD BEACH" "BAY HARBOR ISLANDS" "CHERRY HILLS VILLAGE"
# combine both vectors
<- c(valid_city, extra_city) many_city
Still, checking against this list is a good way to check for values that need additional attention.
#> # A tibble: 3 × 5
#> id city state zip valid
#> <chr> <chr> <chr> <chr> <lgl>
#> 1 03 MIDLEBURY VT 05753 FALSE
#> 2 08 BRULINGTON VT 05401 FALSE
#> 3 10 SA TX 78202 FALSE
It might not be clear what’s actually wrong with these values. A common way to check is by comparing them against expected city for a given ZIP code.
<- left_join(
bad x = bad,
y = zipcodes,
by = c("zip", "state"),
suffix = c("_raw", "_match")
)
id | city_raw | state | zip | city_match |
---|---|---|---|---|
03 | MIDLEBURY | VT | 05753 | MIDDLEBURY |
08 | BRULINGTON | VT | 05401 | BURLINGTON |
10 | SA | TX | 78202 | SAN ANTONIO |
Now the problems become clear. Two city names are misspelled and the third is an obvious abbreviation. When dealing with millions of city names, we need a way to check each raw value against it’s expected ZIP code match.
The str_dist()
and is_abbrev()
functions
can be used to compared the value we have with the value we expect. By
only checking against the corresponding city to that record’s ZIP code,
we are making extremely confident changes (compared to the
incredibly useful clustering algorithms like those provided by the
refinr
package).
First, we can use str_dist()
to check the
distance between the two strings; distance is defined as the
number of changes we’d need to make to our normalized value to get our
expected matched value. If that distance is small (usually 1 or 2), we
can pretty confidently use the matched value.
str_dist("example", "xampel")
#> [1] 2
But the string distance does not catch colloquial city abbreviations
(e.g., NYC, BOS, LA, CHI, ABQ, BRNX, DFW, OKC). Many residents get so
used to writing their city’s name they use abbreviations and assume them
to be universally understood. The is_abbrev()
function can
be used to check to one string might be an abbreviation for another.
Every abbreviation generated by the abbreviate()
function
satisfied the requirements of is_abbrev()
.
is_abbrev(abb = "NYC", full = "New York City")
#> [1] TRUE
is_abbrev(abb = "DC", full = "Washington")
#> [1] FALSE
<- bad %>%
bad mutate(
match_dist = str_dist(city_raw, city_match),
match_abb = is_abbrev(city_raw, city_match)
)
id | city_raw | state | zip | city_match | match_dist | match_abb |
---|---|---|---|---|---|---|
03 | MIDLEBURY | VT | 05753 | MIDDLEBURY | 1 | FALSE |
08 | BRULINGTON | VT | 05401 | BURLINGTON | 1 | FALSE |
10 | SA | TX | 78202 | SAN ANTONIO | 9 | TRUE |
Here’s what this process would look like when employed on an entire data frame. It’s important to ensure that the number of rows in our campaign finance data is kept consistent throughout the wrangling process and that original columns are left unchanged.
<- vt %>%
vt rename(city_raw = city) %>%
# match city by ZIP
left_join(zipcodes) %>%
rename(city_match = city) %>%
mutate(
# check against match
match_dist = str_dist(city_raw, city_match),
match_abb = is_abbrev(city_raw, city_match),
city = ifelse(match_abb | match_dist == 1, city_match, city_raw)
%>%
) # remove intermediary columns
select(-city_raw, -city_match, -match_dist, -match_abb)
#> Joining, by = c("state", "zip")
Now every city
, state
, and zip
value is contained in our list of valid values.
id | state | zip | city | all_valid |
---|---|---|---|---|
01 | VT | 05866 | SHEFFIELD | TRUE |
03 | VT | 05753 | MIDDLEBURY | TRUE |
05 | VT | 05866 | SHEFFIELD | TRUE |
06 | MN | 55555 | YOUNG AMERICA | TRUE |
08 | VT | 05401 | BURLINGTON | TRUE |
09 | DC | 20001 | WASHINGTON | TRUE |
10 | TX | 78202 | SAN ANTONIO | TRUE |
Once our data is as normal as we can confidently make it, we can
begin to explore. First, we’ll explore the data for missing values with
flag_na()
, which takes a tidyselect input of columns (or something like
dplyr::everything()
).
<- flag_na(vt, name))
(vt #> # A tibble: 10 × 10
#> id cand date amount name address state zip city na_flag
#> <chr> <chr> <date> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 01 Bob Miller 2019-02-09 10 Lisa M… 4 SHEF… VT 05866 SHEF… FALSE
#> 2 02 Bob Miller 2009-03-09 20 Deb Br… <NA> <NA> <NA> <NA> FALSE
#> 3 03 Chuck White 2019-04-09 25 <NA> PO BOX… VT 05753 MIDD… TRUE
#> 4 04 Chuck White 2019-05-09 100 Josh J… SUGARH… VT 5076 <NA> FALSE
#> 5 05 Bob Miller 2019-02-09 10 Lisa M… 4 SHEF… VT 05866 SHEF… FALSE
#> 6 06 Chuck White 2019-06-09 1000 Bob Ta… 55 THI… MN 55555 YOUN… FALSE
#> 7 07 Chuck White 2019-07-09 -600 Alex J… 11 LIB… VT <NA> <NA> FALSE
#> 8 08 Alice Walsh 2019-08-09 0 Ruth S… 2 BURL… VT 05401 BURL… FALSE
#> 9 09 Alice Walsh 2019-09-09 69 Joe Ga… 770 5T… DC 20001 WASH… FALSE
#> 10 10 Alice Walsh 2019-11-09 222 Dave W… <NA> TX 78202 SAN … FALSE
Next, we’ll want to check for duplicate rows using
flag_dupes()
, which takes the same kind of arguments. Here,
we can ignore the supposedly unique id
variable. It’s
possible for a person to make the same contribution on the same date,
but we should flag them nonetheless.
<- flag_dupes(vt, -id, .both = TRUE))
(vt #> # A tibble: 10 × 11
#> id cand date amount name address state zip city na_flag
#> <chr> <chr> <date> <dbl> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 01 Bob Miller 2019-02-09 10 Lisa M… 4 SHEF… VT 05866 SHEF… FALSE
#> 2 02 Bob Miller 2009-03-09 20 Deb Br… <NA> <NA> <NA> <NA> FALSE
#> 3 03 Chuck White 2019-04-09 25 <NA> PO BOX… VT 05753 MIDD… TRUE
#> 4 04 Chuck White 2019-05-09 100 Josh J… SUGARH… VT 5076 <NA> FALSE
#> 5 05 Bob Miller 2019-02-09 10 Lisa M… 4 SHEF… VT 05866 SHEF… FALSE
#> 6 06 Chuck White 2019-06-09 1000 Bob Ta… 55 THI… MN 55555 YOUN… FALSE
#> 7 07 Chuck White 2019-07-09 -600 Alex J… 11 LIB… VT <NA> <NA> FALSE
#> 8 08 Alice Walsh 2019-08-09 0 Ruth S… 2 BURL… VT 05401 BURL… FALSE
#> 9 09 Alice Walsh 2019-09-09 69 Joe Ga… 770 5T… DC 20001 WASH… FALSE
#> 10 10 Alice Walsh 2019-11-09 222 Dave W… <NA> TX 78202 SAN … FALSE
#> # … with 1 more variable: dupe_flag <lgl>
This normalized data is now ready to be uploaded to the Accountability Project and searched alongside 1 billion other records! These cleaned names and addresses might bring up search results alongside one of our other sets of public data: campaign expenditures, registered voters, nonprofit organizations, stimulus spending, government contracts, lobbyist registrations, etc.
id | cand | date | amount | name | address | state | zip | city | na_flag | dupe_flag |
---|---|---|---|---|---|---|---|---|---|---|
01 | Bob Miller | 2019-02-09 | 10 | Lisa Miller | 4 SHEFFIELD SQUARE RD | VT | 05866 | SHEFFIELD | FALSE | TRUE |
02 | Bob Miller | 2009-03-09 | 20 | Deb Brown | NA | NA | NA | NA | FALSE | FALSE |
03 | Chuck White | 2019-04-09 | 25 | NA | PO BOX 567 | VT | 05753 | MIDDLEBURY | TRUE | FALSE |
04 | Chuck White | 2019-05-09 | 100 | Josh Jones | SUGARHOUSE SQ | VT | 5076 | NA | FALSE | FALSE |
05 | Bob Miller | 2019-02-09 | 10 | Lisa Miller | 4 SHEFFIELD SQUARE RD | VT | 05866 | SHEFFIELD | FALSE | TRUE |
06 | Chuck White | 2019-06-09 | 1000 | Bob Taylor | 55 THISPLACE AVE | MN | 55555 | YOUNG AMERICA | FALSE | FALSE |
07 | Chuck White | 2019-07-09 | -600 | Alex Johnson | 11 LIBERTY AVE | VT | NA | NA | FALSE | FALSE |
08 | Alice Walsh | 2019-08-09 | 0 | Ruth Smith | 2 BURLINGTON SQ | VT | 05401 | BURLINGTON | FALSE | FALSE |
09 | Alice Walsh | 2019-09-09 | 69 | Joe Garcia | 770 5THSTNW | DC | 20001 | WASHINGTON | FALSE | FALSE |
10 | Alice Walsh | 2019-11-09 | 222 | Dave Wilson | NA | TX | 78202 | SAN ANTONIO | FALSE | FALSE |