forked from rstudio/pointblank
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
339 lines (274 loc) · 14 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
<img src="inst/graphics/pointblank_logo.png">
[](https://travis-ci.org/rich-iannone/pointblank)
[](https://codecov.io/github/rich-iannone/pointblank?branch=master)
[](https://cran.r-project.org/package=pointblank)
## We All Need to Validate our Data Sometimes
Tables can often be trustworthy. All the data seems to be there and we may feel we can count on these tables to deliver us the info we need. Still, sometimes, the tables we trust are hiding things from us. Malformed strings, numbers we don't expect, missing values that ought not to be missing. These abberations can be hiding almost in plain sight. Such inconsistencies can be downright insidious, and with time all of this makes us ask ourselves, "can we really trust any table?"
Sure, we can sit down with a table during a long interrogation session and rough it up with a little **SQL**. The problem is we have lots of tables, and we usually have a lot of columns in every one of these tables. Makes for long hours with many suspect tables...
We need a tool like **pointblank**. It lets us get up close with tables and unleash a fury of validation checks. Are some tables in remote databases? That's no problem, we'll interrogate them from afar. In essence, your DB tables can get the same line of questioning as your local data frames or those innocent-looking tibbles. Trust me, they'll start to talk and then they'll surely reveal what they're hiding after an intense **pointblank** session.
You don't have to type up a long report either, **pointblank** will take care of the paperwork. At the very least, you can get a `yes` or `no` as to whether everything checked out. With a little bit of planning, a very informative validation report can be regularly produced. We can even fire off emails or send messages to Slack if things get out of hand.
### Validating Local Data Frames
The **pointblank** package can validate data in local data frames, local tibble objects, in CSV and TSV files, and in database tables (**PostgreSQL** and **MySQL**). First, let's look at local tables with...
<img src="inst/graphics/example_workflow.png">
The above workflow relies on these code blocks:
1. Create 2 very simple **R** **tibble** objects:
```r
library(tibble)
tbl_1 <-
tibble::tribble(
~a, ~b, ~c,
1, 6, "h2",
2, 7, "h2",
3, 8, "h2",
4, 9, "d3",
5, 10, "h2")
tbl_2 <-
tibble::tribble(
~d, ~e, ~f,
"a", 0, 32,
"b", 0, 31,
"a", 1, 30,
"a", 1, 32,
"ae", -1, 39)
```
2. Create a **pointblank** pipeline for validating both the `tbl_1` and `tbl_2` tables (ending with `interrogate()`):
```r
library(pointblank)
agent <-
create_agent() %>% # (1)
focus_on(
tbl_name = "tbl_1") %>% # (2)
col_vals_gt(
column = a,
value = 0) %>% # (3)
rows_not_duplicated(
cols = a & b & c) %>% # (4)
col_vals_gte(
column = a + b,
value = 7) %>% # (5)
col_vals_lte(
column = b,
value = 10) %>% # (6)
col_vals_regex(
column = c,
regex = "[a-z][0-9]") %>% # (7)
col_vals_in_set(
column = c,
set = c("h2", "d3")) %>% # (8)
focus_on(
tbl_name = "tbl_2") %>% # (9)
col_vals_in_set(
column = d,
set = c("a", "b")) %>% # (10)
col_vals_not_in_set(
column = d,
set = c("a", "b")) %>% # (11)
col_vals_gte(
column = e,
value = 0) %>% # (12)
col_vals_null(
column = f) %>% # (13)
col_vals_not_null(
column = d) %>% # (14)
interrogate() # (15)
```
We can get a detailed summary report of the interrogation, showing how many individual tests in each validation step had passed or failed. The validation steps are classified with an `action` which indicates the type of action to perform based on user-defined thresholds (thresholds can be set globally, or, for each validation step). Each step can be given a `brief`, which is a short description of the validation step. If a `brief` is not provided, it will be autogenerated based on the input parameters.
```r
library(pointblank)
get_interrogation_summary(agent)[1:5]
#> # A tibble: 11 x 5
#> tbl_name db_type assertion_type column value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 tbl_1 local_df col_vals_gt a 0
#> 2 tbl_1 local_df rows_not_duplicated a, b, c NA
#> 3 tbl_1 local_df col_vals_gte a + b 7
#> 4 tbl_1 local_df col_vals_lte b 10
#> 5 tbl_1 local_df col_vals_regex c NA
#> 6 tbl_1 local_df col_vals_in_set c NA
#> 7 tbl_2 local_df col_vals_in_set d NA
#> 8 tbl_2 local_df col_vals_not_in_set d NA
#> 9 tbl_2 local_df col_vals_gte e 0
#> 10 tbl_2 local_df col_vals_null f NA
#> 11 tbl_2 local_df col_vals_not_null d NA
get_interrogation_summary(agent)[6:11]
#> # A tibble: 11 x 6
#> regex all_passed n n_passed f_passed action
#> <chr> <lgl> <dbl> <dbl> <dbl> <chr>
#> 1 <NA> TRUE 5 5 1.0 <NA>
#> 2 <NA> TRUE 5 5 1.0 <NA>
#> 3 <NA> TRUE 5 5 1.0 <NA>
#> 4 <NA> TRUE 5 5 1.0 <NA>
#> 5 [a-z][0-9] TRUE 5 5 1.0 <NA>
#> 6 <NA> TRUE 5 5 1.0 <NA>
#> 7 <NA> FALSE 5 4 0.8 warn
#> 8 <NA> FALSE 5 1 0.2 warn
#> 9 <NA> FALSE 5 4 0.8 warn
#> 10 <NA> FALSE 5 0 0.0 warn
#> 11 <NA> TRUE 5 5 1.0 <NA>
get_interrogation_summary(agent)[12]
#> # A tibble: 11 x 1
#> brief
#> <chr>
#> 1 Expect that values in `a` should be > 0
#> 2 Expect that rows from `a, b, c` have no duplicates
#> 3 Expect that values in `a + b` (computed column) should be > 7
#> 4 Expect that values in `b` should be > 10
#> 5 Expect that values in `c` should match the regex expression `[a-z][0-9]`
#> 6 Expect that values in `c` should be part of set `h2, d3`
#> 7 Expect that values in `d` should be part of set `a, b`
#> 8 Expect that values in `d` should not be part of set `a, b`
#> 9 Expect that values in `e` should be > 0
#> 10 Expect that values in `f` should be NULL
#> 11 Expect that values in `d` should not be NULL
```
A self-contained HTML report can be generated. It provides detailed information on the validation outcomes and it can be used as web content.
```r
library(pointblank)
get_html_summary(agent)
```
### Constraining Data in Validation Steps
Every validation function has a common set of options for constraining validations to certain conditions. This can occur through the use of computed columns (e.g, `column = col_a / col_b`) and also through `preconditions` that can allow you to target validations on only those rows that satisfy one or more conditions (e.g, `preconditions = col_a > 10 & col_b < 200`). When validating database tables, we have the option of modifying the table of focus more radically by supplying an SQL statement to `initial_sql`.
<img src="inst/graphics/function_options.png">
### Validating Tables in a Database
We can easily validate tables in a **PostgreSQL** or **MySQL**. To facilitate access to DB tables, we create a credentials file and supply it to each `focus_on()` step. The `create_creds_file()` allows for the creation of this file.
```r
library(pointblank)
# Create a credentials file for
# accessing a database
create_creds_file(
file = ".db_creds",
dbname = ***********,
host = ***********************,
port = ***,
user = ********,
password = **************)
```
The functional pipeline to validate database tables is not very different than that for local tables. With database tables, however, we have the additional option to supply an SQL statement (as `initial_sql`) to either the `focus_on()` function (this applies the SQL statement to table-in-focus for every subsequent validation step), or, to a specific validation step (this overrides any SQL supplied in `focus_on()`). For convenience, you can either supply an SQL fragment (usually a single `WHERE` statement), or a full-fledged SQL statement that can more greatly transform the table (e.g., using `GROUP BY`, performing table joins, creating new columns, etc.). Any new columns generated via `initial_sql` can be used as a column to validate. Below is an example of what could be done with a mix of `initial_sql` and `preconditions` on a hypothetical **PostgreSQL** table.
```r
library(pointblank)
library(glue)
library(lubridate)
agent <-
create_agent() %>%
focus_on(
tbl_name = "table_1",
db_type = "PostgreSQL",
creds_file = ".db_creds",
initial_sql =
glue(
"WHERE date > '{date}'",
date = today() - days(10))) %>%
col_vals_gte(
column = a,
value = 2) %>%
col_vals_between(
column = b + c + d,
left = 50,
right = 100,
preconditions =
c > d & !is.na(b)) %>%
col_vals_lte(
column = sum_c,
value = 100,
initial_sql = "
SELECT date, a, b, SUM(c) AS sum_c
FROM table_1
WHERE a < 10 AND b IS NOT NULL
GROUP BY date, a") %>%
interrogate()
```
### Creating and Accessing Row Data that Failed Validation
We can collect row data that didn't pass a validation step. The amount of non-passing row data collected can be configured in the `interrogate()` function call. When validating local data (data frames, tibbles, CSVs/TSVs), we can set `get_problem_rows = TRUE` and provide values to either of:
- `get_first_n`: Get the first `n` non-passing rows from each validation step.
- `sample_n`: Sample `n` non-passing rows from each validation step.
- `sample_frac`: Sample a fraction of the total non-passing rows from each validation step.
For remote tables, we cannot use any of the `sample_*` arguments to collect non-passing rows (only `get_first_n` currently works). In order to avoid sampling more rows than could reasonably be handled with `sample_frac`, the `sample_limit` argument allows us to provide a sensible limit to the number of rows returned.
The amount of row data available depends on both the fraction of rows that didn't pass a validation step and the level of sampling or explicit collection from that set of rows (this is defined within the `interrogate()` call).
Here is an example of how rows that didn't pass a validation step can be collected and accessed:
```r
library(pointblank)
# Set a seed
set.seed(23)
# Create a simple data frame with a
# column of numerical values
df <-
data.frame(
a = rnorm(
n = 100,
mean = 5,
sd = 2))
# Create 2 simple validation steps
# that test values of column `a`
agent <-
create_agent() %>%
focus_on(tbl_name = "df") %>%
col_vals_between(
column = a,
left = 4,
right = 6) %>%
col_vals_lte(
column = a,
value = 10) %>%
interrogate(
get_problem_rows = TRUE,
get_first_n = 10)
# Find out which of the two
# validation steps contain sample
# row data (it is step 1)
get_row_sample_info(agent)[, 1:5]
#> step tbl type n_fail n_sampled
#> 1 1 df col_vals_between 65 10
# Get row sample data for those rows
# in `df` that did not pass the first
# validation step (where column values
# for `col_vals_between()` were not
# between 4 and 6); the leading column,
# `pb_step_`, has been added to provide
# context on the validation step for
# which these rows failed to pass
agent %>%
get_row_sample_data(step = 1)
#> # A tibble: 10 x 2
#> pb_step_ a
#> <int> <dbl>
#> 1 1 6.826534
#> 2 1 8.586776
#> 3 1 6.993210
#> 4 1 7.214981
#> 5 1 7.038411
#> 6 1 8.151559
#> 7 1 2.906929
#> 8 1 2.567247
#> 9 1 3.959643
#> 10 1 3.801374
```
### Functions Available in the Package
These workflow examples provided a glimpse of some of the functions available. For sake of completeness, here's the entire set of functions:
<img src="inst/graphics/pointblank_functions.png">
## Installation
**pointblank** is used in an **R** environment. If you don't have an **R** installation, it can be obtained from the [**Comprehensive R Archive Network (CRAN)**](https://cran.r-project.org/).
The **CRAN** version of this package can be obtained using the following statement:
```r
install.packages("pointblank")
```
You can install the development version of **pointblank** from **GitHub** using the **devtools** package.
```r
devtools::install_github("rich-iannone/pointblank")
```
If you encounter a bug, have usage questions, or want to share ideas to make this package better, feel free to file an [issue](https://github.com/rich-iannone/pointblank/issues).
## Code of Conduct
[Contributor Code of Conduct](https://github.com/rich-iannone/pointblank/blob/master/CONDUCT.md). By participating in this project you agree to abide by its terms.
## License
MIT © Richard Iannone