Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel #21

Open
wants to merge 69 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
6d6b38d
created abstract classes for detection and correction. added parallel…
SpeedyRagou Nov 9, 2023
5f2e499
fixed issues with detection
SpeedyRagou Nov 9, 2023
ae20554
updated correction example
SpeedyRagou Nov 9, 2023
8e67bb9
updated gitignore to include results folder of detection and correction
SpeedyRagou Nov 9, 2023
c7626e8
updated parallel correction and detection to mirror interface and fix…
SpeedyRagou Nov 9, 2023
b396674
fixed DetectionParallel signature to match interface
SpeedyRagou Nov 9, 2023
7e7f270
added parallel example
SpeedyRagou Nov 9, 2023
adcda92
optimized imports
SpeedyRagou Nov 9, 2023
deed560
fixed bugs
SpeedyRagou Nov 9, 2023
8c84c33
fixed bug at parallel detection
SpeedyRagou Nov 9, 2023
6f3d5e7
updated init
SpeedyRagou Nov 9, 2023
03e0474
fixed bug
SpeedyRagou Nov 9, 2023
e69bab3
added parallel to init
SpeedyRagou Nov 9, 2023
57840fb
fixed import issue
SpeedyRagou Nov 9, 2023
f20feaa
updated parallel detection example
SpeedyRagou Nov 10, 2023
36229b1
added try catch for shutting down client
SpeedyRagou Nov 10, 2023
d1a7d2e
matched signature for CorrectionParallel to abstract parent class
SpeedyRagou Nov 10, 2023
1b7901e
fixed bug
SpeedyRagou Nov 10, 2023
bd6d436
added comment
SpeedyRagou Nov 10, 2023
a09aa5b
updated parallel detection run to use dataset instead of dictionary
SpeedyRagou Nov 10, 2023
d58a47e
added new comment
SpeedyRagou Nov 10, 2023
b7bac5b
updated baselines.py to use sequential dataset
SpeedyRagou Nov 10, 2023
8c67ba5
fixed a few import
SpeedyRagou Nov 10, 2023
761ad42
updated utilities.py to work with sequential raha
SpeedyRagou Nov 10, 2023
8c9a3a9
updated pipeline_1 to work with sequential raha and baran
SpeedyRagou Nov 10, 2023
8226d75
updated pipeline_2 to work with sequential raha and baran
SpeedyRagou Nov 10, 2023
7602d2e
updated pipeline_3 to work with sequential raha and baran
SpeedyRagou Nov 10, 2023
75c53fe
updated benchmark to work with sequential raha and baran
SpeedyRagou Nov 10, 2023
e716da2
added stub for dataset interface
SpeedyRagou Nov 10, 2023
a259d4b
added dataset interface and made dataset_parallel signatures match th…
SpeedyRagou Nov 12, 2023
d5012d9
updated requirements to include dask
SpeedyRagou Nov 12, 2023
ddec072
updated README.md
SpeedyRagou Nov 13, 2023
d399314
updated interface
SpeedyRagou Nov 13, 2023
823e273
fixed arguments
SpeedyRagou Nov 13, 2023
a4a2edd
fixed dataframe
SpeedyRagou Nov 13, 2023
67547f1
updated requirements.txt
SpeedyRagou Nov 13, 2023
3aaa28b
updated detection interface with comments
SpeedyRagou Nov 13, 2023
c3f4897
renamed folders. made sure parallel detection mirrors original detection
SpeedyRagou Nov 15, 2023
bf99289
added reference for dask version
SpeedyRagou Nov 15, 2023
d5fd1b2
fixed import issues after changing folder structure and class name
SpeedyRagou Nov 15, 2023
f30cb3e
fixed import issues after changing folder structure and class name
SpeedyRagou Nov 15, 2023
ed15e12
updated imports
SpeedyRagou Nov 15, 2023
9947649
updated dask folder to dask_version to stop name collision
SpeedyRagou Nov 15, 2023
022ab8c
fixed imports
SpeedyRagou Nov 15, 2023
eb053f8
reshuffled files
SpeedyRagou Nov 15, 2023
c4490e3
updated setup to have optional install for dask version
SpeedyRagou Nov 15, 2023
9fe30f0
updated MANIFEST.in
SpeedyRagou Nov 15, 2023
327c45b
added verbose support
SpeedyRagou Nov 16, 2023
5c552f2
updated README.md
SpeedyRagou Nov 16, 2023
df93c1b
added verbose
SpeedyRagou Nov 16, 2023
6e26a7b
fixed double assignment
SpeedyRagou Nov 16, 2023
d88d776
fixed correct path
SpeedyRagou Nov 16, 2023
af0f76d
added comment
SpeedyRagou Nov 16, 2023
385c82d
windows fix for original version
SpeedyRagou Nov 23, 2023
7cd013e
fixed too long sharedmemory names
SpeedyRagou Nov 24, 2023
6b023aa
quick fix
SpeedyRagou Nov 24, 2023
962da0d
quick fix
SpeedyRagou Nov 24, 2023
b94cba5
added constant.py usage for detection.py; added version of pandas to …
SpeedyRagou Dec 1, 2023
31c684d
fixed pre-loading
SpeedyRagou Jan 8, 2024
43b979c
added result_storing for dask_version
SpeedyRagou Jan 8, 2024
253ffa2
fixed strategy filtering
SpeedyRagou Jan 8, 2024
8278258
removed flags
SpeedyRagou Jan 12, 2024
49bc6c9
fixed sequential raha executions
SpeedyRagou Jan 14, 2024
2f6b5e4
baran tiny fix
SpeedyRagou Jan 14, 2024
b73dc81
raha dask_version -> does not try to create new folder if folder alre…
SpeedyRagou Jan 16, 2024
61a08c0
removed idea files from git
SpeedyRagou Jan 16, 2024
f0a876c
updated gitignore
SpeedyRagou Jan 16, 2024
df2f176
tiny baran fix
SpeedyRagou Jan 18, 2024
ac7e0be
tiny baran fix
SpeedyRagou Jan 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -152,9 +152,11 @@ dmypy.json
# Cython debug symbols
cython_debug/

# Raha
raha-baran-results-*
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/
9 changes: 6 additions & 3 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
recursive-include raha/tools/dBoost *
recursive-include raha/tools/KATARA *
include raha/tools/KATARA/knowledge-base/*
recursive-include raha/original/tools/dBoost *
recursive-include raha/original/tools/KATARA *
include raha/original/tools/KATARA/knowledge-base/*
recursive-include raha/dask_version/tools/dBoost *
recursive-include raha/dask_version/tools/KATARA *
include raha/dask_version/tools/KATARA/knowledge-base/*
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,32 @@ To install Raha and Baran, you can run:
```console
pip3 install raha
```
To install Raha and Baran with Dask, you can run:
```console
pip3 install raha[dask]
```

To install Raha and Baran using the github repository:
```console
git clone [email protected]:BigDaMa/raha.git
pip3 install -e raha
```

To install Raha and Baran with Dask using the github repository:
```console
git clone [email protected]:BigDaMa/raha.git
pip3 install -e raha[dask]
```

To uninstall them, you can run:
```console
pip3 uninstall raha
```

## Usage
Running Raha and Baran is simple!
- **Benchmarking**: If you have a dirty dataset and its corresponding clean dataset and you want to benchmark Raha and Baran, please check the sample codes in `raha/benchmark.py`, `raha/detection.py`, and `raha/correction.py`.
- **Interactive data cleaning with Raha and Baran**: If you have a dirty dataset and you want to interatively detect and correct data errors, please check our interactive Jupyter notebooks in the `raha` folder. The Jupyter notebooks provide graphical user interfaces.
- **Benchmarking**: If you have a dirty dataset and its corresponding clean dataset and you want to benchmark Raha and Baran, please check the sample codes in `raha/original/benchmark.py`, `raha/original/detection.py`, and `raha/original/correction.py`.
- **Interactive data cleaning with Raha and Baran**: If you have a dirty dataset and you want to interatively detect and correct data errors, please check our interactive Jupyter notebooks in the `raha/original` folder. The Jupyter notebooks provide graphical user interfaces.
![Data Annotation](pictures/ui.png)
![Promising Strategies](pictures/ui_strategies.png)
![Drill Down](pictures/ui_clusters.png)
Expand Down Expand Up @@ -61,7 +71,8 @@ You can find more information about this project and the authors [here](https://
publisher={VLDB Endowment}
}
```

### Dask Version
The implementation for Raha and Baran with Dask was created by Yusuf Mandirali. The original code can be found [here](https://github.com/yimlyim/DaskRaha).

## A Note on the Naming
Raha and Baran are Persian feminine names that are conceptually related to their corresponding error detection/correction systems. Raha (which means "free" in Persian) is assigned to our "configuration-free" error detection system. Baran (which means "rain" in Persian and rain washes/cleans everything) is assigned to our error correction system that "cleans" data.
29 changes: 25 additions & 4 deletions raha/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,29 @@
from .dataset import *
from .detection import *
from .correction import *
from .baselines import *
from .utilities import *
from .constants import *
from .dataset import *
from .benchmark import *
from .tools.KATARA.katara import *
from .tools.dBoost.dboost.imported_dboost import *

from .original.utilities import *

from .original.tools.KATARA import *
from .original.tools.KATARA.katara import *
from .original.tools.dBoost import *
from .original.tools.dBoost.dboost import *
from .original.tools.dBoost.dboost.imported_dboost import *
from .original.dataset import *
from .original.detection import *
from .original.correction import *
from .original import *

from .dask_version import *
from .dask_version.dataset_parallel import *
from .dask_version.tools.KATARA import *
from .dask_version.tools.KATARA.katara import *
from .dask_version.tools.dBoost import *
from .dask_version.tools.dBoost.dboost import *
from .dask_version.tools.dBoost.dboost.imported_dboost import *
from .dask_version.container import *
from .dask_version.detection_parallel import *
from .dask_version.correction_parallel import *
14 changes: 7 additions & 7 deletions raha/baselines.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ def run_dboost(self, dd):
print("------------------------------------------------------------------------\n"
"------------------------------Running dBoost----------------------------\n"
"------------------------------------------------------------------------")
d = raha.dataset.Dataset(dd)
d = raha.original.dataset.Dataset(dd)
sp_folder_path = os.path.join(os.path.dirname(dd["path"]), "raha-baran-results-" + d.name, "strategy-profiling")
strategy_profiles_list = [pickle.load(open(os.path.join(sp_folder_path, strategy_file), "rb"))
for strategy_file in os.listdir(sp_folder_path)]
Expand Down Expand Up @@ -135,7 +135,7 @@ def run_nadeef(self, dd):
print("------------------------------------------------------------------------\n"
"------------------------------Running NADEEF----------------------------\n"
"------------------------------------------------------------------------")
d = raha.dataset.Dataset(dd)
d = raha.original.dataset.Dataset(dd)
detection_dictionary = {}
for fd in self.DATASET_CONSTRAINTS[d.name]["functions"]:
l_attribute, r_attribute = fd
Expand Down Expand Up @@ -171,7 +171,7 @@ def run_katara(self, dd):
print("------------------------------------------------------------------------\n"
"------------------------------Running KATARA----------------------------\n"
"------------------------------------------------------------------------")
d = raha.dataset.Dataset(dd)
d = raha.original.dataset.Dataset(dd)
sp_folder_path = os.path.join(os.path.dirname(dd["path"]), "raha-baran-results-" + d.name, "strategy-profiling")
strategy_profiles_list = [pickle.load(open(os.path.join(sp_folder_path, strategy_file), "rb"))
for strategy_file in os.listdir(sp_folder_path)]
Expand All @@ -190,7 +190,7 @@ def run_activeclean(self, dd, sampling_budget=20):
print("------------------------------------------------------------------------\n"
"----------------------------Running ActiveClean-------------------------\n"
"------------------------------------------------------------------------")
d = raha.dataset.Dataset(dd)
d = raha.original.dataset.Dataset(dd)
actual_errors_dictionary = d.get_actual_errors_dictionary()
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(min_df=1, stop_words="english")
text = [" ".join(row) for row in d.dataframe.values.tolist()]
Expand Down Expand Up @@ -238,7 +238,7 @@ def run_min_k(self, dd):
print("------------------------------------------------------------------------\n"
"------------------------------Running Min-k-----------------------------\n"
"------------------------------------------------------------------------")
d = raha.dataset.Dataset(dd)
d = raha.original.dataset.Dataset(dd)
sp_folder_path = os.path.join(os.path.dirname(dd["path"]), "raha-baran-results-" + d.name, "strategy-profiling")
strategy_profiles_list = [pickle.load(open(os.path.join(sp_folder_path, strategy_file), "rb"))
for strategy_file in os.listdir(sp_folder_path)]
Expand Down Expand Up @@ -272,7 +272,7 @@ def run_maximum_entropy(self, dd, sampling_budget=20):
print("------------------------------------------------------------------------\n"
"--------------------------Running Maximum Entropy-----------------------\n"
"------------------------------------------------------------------------")
d = raha.dataset.Dataset(dd)
d = raha.original.dataset.Dataset(dd)
actual_errors_dictionary = d.get_actual_errors_dictionary()
sp_folder_path = os.path.join(os.path.dirname(dd["path"]), "raha-baran-results-" + d.name, "strategy-profiling")
strategy_profiles_list = [pickle.load(open(os.path.join(sp_folder_path, strategy_file), "rb"))
Expand Down Expand Up @@ -306,7 +306,7 @@ def run_metadata_driven(self, dd, sampling_budget=20):
print("------------------------------------------------------------------------\n"
"--------------------------Running Metadata Driven-----------------------\n"
"------------------------------------------------------------------------")
d = raha.dataset.Dataset(dd)
d = raha.original.dataset.Dataset(dd)
actual_errors_dictionary = d.get_actual_errors_dictionary()
dboost_output = self.run_dboost(dd)
nadeef_output = self.run_nadeef(dd)
Expand Down
Loading