Summarize2 generates an interactive HTML report highlighting key differences between two datasets. It has:
- Command-line interface
- User-defined data type handling for when defaults fail (age interpreted as a continuous variable)
- Intelligent guessing of time and date formats. It will also try to guess if your dataset has any breaks in the time series.
- Graphical representation of the top 5 most different distributions of all combinations in any user-defined data slice shared between two datasets.
Clone or download the repository, open the folder in your terminal and run pip install .. If you don't have all the dependencies already, run pip install -r requirements.txt . instead. To run an editable version of the tool add -e flag to pip.
Included in the repo are two sample datasets for comparison. One is a test modelling dataset generated using the synthpop R package and its original, and another is a basic example of manually tweaked data to "engineer" some of the key differences, such as the number of NAs or duplicates.
Summarize2 has the following Python dependencies:
- Pandas (with xlrd for Excel files)
- Bokeh
- Jinja2
- Scipy