-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
156 lines (125 loc) · 5.97 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
output: github_document
always_allow_html: yes
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
if (!interactive()) {
options(width = 99)
}
```
# campfin <img src="man/figures/logo.png" align="right" width="120" />
<!-- badges: start -->
[![Lifecycle: maturing][life_badge]][life_link]
[![CRAN status][cran_badge]][cran_link]
![Downloads][dl_badge]
[![Codecov test coverage][cov_badge]][cov_link]
[![R build status][gh_badge]][gh_link]
<!-- badges: end -->
The campfin package was created to facilitate the work being done on the
[The Accountability Project][tap], a tool created by
[The Investigative Reporting Workshop][irw] in Washington, DC. The
Accountability Project curates, cleans, and indexes public data to give
journalists, researchers and others a simple way to search across otherwise
siloed records.
The data focuses on people, organizations and locations. This package was
created specifically to help with state-level campaign finance data, although
the tools included are useful in general database exploration and normalization.
## Installation
You can install the released version of campfin from [CRAN][cran] with:
```{r install_cran, eval=FALSE}
install.packages("campfin")
```
The development version can be installed from [GitHub][gh] with:
```{r install_github, eval=FALSE}
# install.packages("remotes")
remotes::install_github("irworkshop/campfin")
```
## Normalize
The package was originally built to normalize geographic data using the
`normal_*()` functions, which take the messy self-reported geographic data of a
contributor, vendor, candidate, or committee and return [normalized text][txt]
that is more searchable. They are largely wrappers around the [stringr] package,
and can call other sub-functions to streamline normalization.
* `normal_address()` takes a _street_ address and reduces inconsistencies.
* `normal_zip()` takes [ZIP Codes][zip] and aims to return a valid 5-digit code.
* `normal_state()` takes US states and returns a [2 digit abbreviation][abbs].
* `normal_city()` takes cities and reduces inconsistencies.
* `normal_phone()` consistently formats US telephone numbers.
Please see the vignette on normalization for an example of how these functions
are used to fix a wide variety of string inconsistencies and make campaign
finance data more consistent.
## Data
```{r library, message=FALSE, warning=FALSE}
library(campfin)
library(tidyverse)
```
The campfin package contains a number of built in data frames and strings used
to help wrangle campaign finance data.
The `/data-raw` directory contains the code used to create the objects.
### zipcodes
The `zipcodes` (plural) table is a new version of the `zipcode` (singular) table
from the archived [zipcode] R package.
> This database was composed using ZIP code gazetteers from the US Census Bureau
from 1999 and 2000, augmented with additional ZIP code information The database
is believed to contain over 98% of the ZIP Codes in current use in the United
States. The remaining ZIP Codes absent from this database are entirely PO Box or
Firm ZIP codes added in the last five years, which are no longer published by
the Census Bureau, but in any event serve a very small minority of the
population (probably on the order of .1% or less). Although every attempt has
been made to filter them out, this data set may contain up to .5% false
positives, that is, ZIP codes that do not exist or are no longer in use but are
included due to erroneous data sources.
The included `valid_city` and `valid_zip` vectors are sorted, unique columns
from the `zipcodes` data frame.
```{r geo_df, collapse=TRUE, warning=FALSE, message=FALSE, error=FALSE}
sample_frac(zipcodes)
```
### `usps_*` and `valid_*`
The `usps_*` data frames were scraped from the official United States Postal
Service (USPS) [Postal Addressing Standards][pub]. These data frames are
designed to work with the abbreviation functionality of `normal_address()` and
`normal_city()` to replace common abbreviations with their full equivalent.
`usps_city` is a curated subset of `usps_state`, whose full version appear at
least once in the `valid_city` vector from `zipcodes`. The `valid_state` and
`valid_name` vectors contain the columns from `usps_state` and include
territories not found in R's build in `state.abb` and `state.name` vectors.
```{r usps}
sample_n(usps_street, 3)
sample_n(usps_state, 3)
setdiff(valid_state, state.abb)
```
*****
The campfin project is released with a [Contributor Code of Conduct][coc]. By
contributing, you agree to abide by its terms.
<!-- refs: start -->
[life_badge]: https://img.shields.io/badge/lifecycle-maturing-blue.svg
[life_link]: https://lifecycle.r-lib.org/articles/stages.html
[cran_badge]: https://www.r-pkg.org/badges/version/campfin
[cran_link]: https://CRAN.R-project.org/package=campfin
[dl_badge]: https://cranlogs.r-pkg.org/badges/grand-total/campfin
[cov_badge]: https://img.shields.io/codecov/c/github/irworkshop/campfin/master.svg
[cov_link]: https://app.codecov.io/gh/irworkshop/campfin?branch=master
[gh_badge]: https://github.com/irworkshop/campfin/workflows/R-CMD-check/badge.svg
[gh_link]: https://github.com/irworkshop/campfin/actions
[tap]: https://www.publicaccountability.org/
[irw]: https://investigativereportingworkshop.org/
[fin]: https://en.wikipedia.org/wiki/Campaign_finance
[cran]: https://cran.r-project.org/package=campfin
[gh]: https://github.com/irworkshop/campfin
[tidyverse]: https://www.tidyverse.org/
[txt]: https://en.wikipedia.org/wiki/Text_normalization
[stringr]: https://github.com/tidyverse/stringr
[zip]: https://en.wikipedia.org/wiki/ZIP_Code
[abbs]: https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations
[zipcode]: https://cran.r-project.org/src/contrib/Archive/zipcode/
[pub]: https://pe.usps.com/text/pub28/28apc_002.htm
[coc]: https://www.contributor-covenant.org/version/2/0/code_of_conduct.html
<!-- refs: end -->