Skip to content

⚑ Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

License

Notifications You must be signed in to change notification settings

ghjklw/soda-core

This branch is 1 commit ahead of sodadata/soda-core:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

3f1018e Β· Feb 24, 2025
Jun 5, 2024
Feb 6, 2024
Apr 3, 2023
Feb 20, 2025
Jan 5, 2024
Mar 16, 2024
Feb 24, 2025
Mar 22, 2022
Oct 21, 2024
Mar 21, 2023
Feb 20, 2025
Aug 19, 2022
Feb 20, 2025
Nov 25, 2022
Jun 28, 2022
Mar 22, 2022
Oct 15, 2024
Jun 5, 2024
Oct 21, 2024
Mar 22, 2022
Mar 22, 2022
Feb 20, 2025
Feb 20, 2025
Feb 20, 2025
Jul 26, 2024

Repository files navigation

Soda Core

Data quality testing for SQL-, Spark-, and Pandas-accessible data.

License: Apache 2.0 Slack


Important

πŸš€ We're hiring! Are you passionate about open-source and love working on projects like Soda Core? Join our team as a Software Engineer and help shape the future of data quality tools. Apply now!


βœ” An open-source, CLI tool and Python library for data quality testing
βœ” Compatible with the Soda Checks Language (SodaCL)
βœ” Enables data quality testing both in and out of your data pipelines and development workflows
βœ” Integrated to allow a Soda scan in a data pipeline, or programmatic scans on a time-based schedule

Soda Core is a free, open-source, command-line tool and Python library that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries.

When it runs a scan on a dataset, Soda Core executes the checks to find invalid, missing, or unexpected data. When your Soda Checks fail, they surface the data that you defined as bad-quality.

Soda Library

Consider migrating to Soda Library, an extension of Soda Core that offers more features and functionality, and enables you to connect to a Soda Cloud account to collaborate with your team on data quality.

Install Soda Library and get started with a 45-day free trial.


Get started

Soda Core currently supports connections to several data sources. See Compatibility for a complete list.

Requirements

  • Python 3.8 or greater
  • Pip 21.0 or greater

Install and run

  1. To get started, use the install command, replacing soda-core-postgres with the package that matches your data source. See Install Soda Core for a complete list.

    pip install soda-core-postgres
  2. Prepare a configuration.yml file to connect to your data source. Then, write data quality checks in a checks.yml file. See Configure Soda Core.

  3. Run a scan to review checks that passed, failed, or warned during a scan. See Run a Soda Core scan.

    soda scan -d your_datasource -c configuration.yml checks.yml

Example checks

# Checks for basic validations
checks for dim_customer:
  - row_count between 10 and 1000
  - missing_count(birth_date) = 0
  - invalid_percent(phone) < 1 %:
      valid format: phone number
  - invalid_count(number_cars_owned) = 0:
      valid min: 1
      valid max: 6
  - duplicate_count(phone) = 0

# Checks for schema changes
checks for dim_product:
  - schema:
      name: Find forbidden, missing, or wrong type
      warn:
        when required column missing: [dealer_price, list_price]
        when forbidden column present: [credit_card]
        when wrong column type:
          standard_cost: money
      fail:
        when forbidden column present: [pii*]
        when wrong column index:
          model_name: 22
# Check for freshness 
  - freshness(start_date) < 1d

# Check for referential integrity
checks for dim_department_group:
  - values in (department_group_name) must exist in dim_employee (department_name)

Documentation

About

⚑ Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.4%
  • Other 0.6%