|
| 1 | +# PClean |
| 2 | + |
| 3 | +[](https://travis-ci.com/probcomp/PClean) |
| 4 | + |
| 5 | +PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning |
| 6 | + |
| 7 | +*Warning: This is a rapidly evolving research prototype.* |
| 8 | + |
| 9 | +PClean was created at the [MIT Probabilistic Computing Project](http://probcomp.csail.mit.edu/). |
| 10 | + |
| 11 | +If you use PClean in your research, please cite the our 2021 AISTATS paper: |
| 12 | + |
| 13 | +PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. Lew, A. K.; Agrawal, M.; Sontag, D.; and Mansinghka, V. K. (2021, March). |
| 14 | +In International Conference on Artificial Intelligence and Statistics (pp. 1927-1935). PMLR. ([pdf](http://proceedings.mlr.press/v130/lew21a/lew21a.pdf)) |
| 15 | + |
| 16 | +## Using PClean |
| 17 | + |
| 18 | + |
| 19 | +To use PClean, create a Julia file with the following structure: |
| 20 | + |
| 21 | +```julia |
| 22 | +using PClean |
| 23 | +using DataFrames: DataFrame |
| 24 | +import CSV |
| 25 | + |
| 26 | +# Load data |
| 27 | +data = CSV.File(filepath) |> DataFrame |
| 28 | + |
| 29 | +# Define PClean model |
| 30 | +PClean.@model MyModel begin |
| 31 | + @class ClassName1 begin |
| 32 | + ... |
| 33 | + end |
| 34 | + |
| 35 | + ... |
| 36 | + |
| 37 | + @class ClassNameN begin |
| 38 | + ... |
| 39 | + end |
| 40 | +end |
| 41 | + |
| 42 | +# Align column names of CSV with variables in the model. |
| 43 | +# Format is ColumnName CleanVariable DirtyVariable, or, if |
| 44 | +# there is no corruption for a certain variable, one can omit |
| 45 | +# the DirtyVariable. |
| 46 | +query = @query MyModel.ClassNameN [ |
| 47 | + HospitalName hosp.name observed_hosp_name |
| 48 | + Condition metric.condition.desc observed_condition |
| 49 | + ... |
| 50 | +] |
| 51 | + |
| 52 | +# Configure observed dataset |
| 53 | +observations = [ObservedDataset(query, data)] |
| 54 | + |
| 55 | +# Configuration |
| 56 | +config = PClean.InferenceConfig(1, 2; use_mh_instead_of_pg=true) |
| 57 | + |
| 58 | +# SMC initialization |
| 59 | +state = initialize_trace(observations, config) |
| 60 | + |
| 61 | +# Rejuvenation sweeps |
| 62 | +run_inference!(state, config) |
| 63 | + |
| 64 | +# Evaluate accuracy, if ground truth is available |
| 65 | +ground_truth = CSV.File(filepath) |> CSV.DataFrame |
| 66 | +results = evaluate_accuracy(data, ground_truth, state, query) |
| 67 | + |
| 68 | +# Can print results.f1, results.precision, results.accuracy, etc. |
| 69 | +println(results) |
| 70 | + |
| 71 | +# Even without ground truth, can save the entire latent database to CSV files: |
| 72 | +PClean.save_results(dir, dataset_name, state, observations) |
| 73 | +``` |
| 74 | + |
| 75 | +Then, from this directory, run the Julia file. |
| 76 | + |
| 77 | +``` |
| 78 | +JULIA_PROJECT=. julia my_file.jl |
| 79 | +``` |
| 80 | + |
| 81 | +To learn to write a PClean model, see [our paper](http://proceedings.mlr.press/v130/lew21a/lew21a.pdf), but note |
| 82 | +the surface syntax changes described below. |
| 83 | + |
| 84 | +## Differences from the paper |
| 85 | + |
| 86 | +As a DSL embedded into Julia, our implementation of the PClean language has some differences, in terms of surface syntax, |
| 87 | +from the stand-alone syntax presented in our paper: |
| 88 | + |
| 89 | +(1) Instead of `latent class C ... end`, we write `@class C begin ... end`. |
| 90 | + |
| 91 | +(2) Instead of `subproblem begin ... end`, inference hints are given using ordinary |
| 92 | + Julia `begin ... end` blocks. |
| 93 | + |
| 94 | +(3) Instead of `parameter x ~ d(...)`, we use `@learned x :: D{...}`. The set of |
| 95 | + distributions D for parameters is somewhat restricted. |
| 96 | + |
| 97 | +(4) Instead of `x ~ d(...) preferring E`, we write `x ~ d(..., E)`. |
| 98 | + |
| 99 | +(5) Instead of `observe x as y, ... from C`, write `@query ModelName.C [x y; ...]`. |
| 100 | + Clauses of the form `x z y` are also allowed, and tell PClean that the model variable |
| 101 | + `C.z` represents a clean version of `x`, whose observed (dirty) version is modeled |
| 102 | + as `C.y`. This is used when automatically reconstructing a clean, flat dataset. |
| 103 | + |
| 104 | +The names of built-in distributions may also be different, e.g. `AddTypos` instead of `typos`, |
| 105 | +and `ProportionsParameter` instead of `dirichlet`. |
0 commit comments