This code is intended to assist in building CSV-schema-validating workflows. It is intended to be used to validate CSVs adhering to source of truth requirements used in the Hydration and Rollout Manager (HRM) workflow. It makes heavy use of Pydantic to do the heavy lifting of validating schemas, providing a few things out of the box:
- An opinionated Pydantic model structure to define and validate the schema of arbitrary user-defined CSV data AND HRM-required data
- A CLI workflow (for CI/CD and local dev)
- Ability to load external, user-provided schema models and check an aribtray CSV file against them
- Optional coercion of data into a prescriptive, normalized structure serialization, dumping this as an output file
In order to write models that integrate with this module and CLI, you should (ideally) have some experience with Python 3.12. You will need to develop comfort writing Pydantic models by becoming familiar with the docs – start here and then read the concepts.
- Python 3.12+
- Pydantic
Ensure you're using Python 3.12.
The software (in src/csv_validator
) is a module that may be installed using setuptools
. To install the CLI to a
virtual environment, do the following:
# create virtualenv
python3 -m venv .
source bin/activate
# install requirements
python3 -m pip install -r requirements.txt
# alternatively - from root, install directly (if not developing further)
python3 -m pip install .
# then invoke the CLI one of two ways:
validate_csv --help
python3 -m csv_validator --help
Note: If you are developing this further, see Development below.
See the Dockerfile for installing this module into a container.
The main library code is in the csv_validator module. The model.py
file contains a basic BaseCluster
model from which all models may (or rather should) subclass.
It does a few things for free:
- It checks required fields, such as
cluster_name
,cluster_group
, andcluster_tags
- Provides (very) basic definitions of valid cluster groups and tags which should (and likely must) be extended
- It likely must be extended because the basic set provides an example list of cluster groups
- For an example of how to extend the
ClusterBase
via subclassing, see the example model: models/example_model.py,class NewBase
, line 82.
There are example models included that demonstrate how to consume and extend this tool:
Each of these models has a set of both valid and invalid CSVs which may be used to demonstrate the functionality of the
model. CSVs live in sources_of_truth
. As an example, to validate a valid "cluster registry" CSV and to see how it
behaves with an invalid CSV, run the following:
Valid:
$ validate_csv -v -m models/cluster_registry.py sources_of_truth/cluster_registry/cluster_reg_sot_valid.csv
INFO CSV is valid
Invalid:
$ validate_csv -v -m models/cluster_registry.py sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv WARNING Optional field 'platform_repository_sync_interval' is not present in sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv
WARNING Optional field 'platform_repository_branch' is not present in sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv
WARNING Optional field 'workload_repository_branch' is not present in sources_of_truth/cluster_registry/cluster_reg_sot_invalid.csv
ERROR Line 2, cluster US75911CLS01, column 'cluster_group', error: 'Input should be 'prod-us', 'nonprod-us', 'prod-au' or 'nonprod-au'', received 'prod-foobar'
ERROR Line 3, cluster US41273, column 'workload_repository_sync_interval', error: 'String should match pattern '^[0-9]*[hms]$'', received '300foo'
ERROR Line 4, cluster US64150CLS01, column 'cluster_group', error: 'Input should be 'prod-us', 'nonprod-us', 'prod-au' or 'nonprod-au'', received 'not-real'
ERROR Line 5, cluster US21646CLS01 , column 'cluster_tags', error: 'Input should be '24/7', 'corp', 'drivethru', 'drivethruduallane' or 'donotupgrade'', received 'ThisIsInvalid'
ERROR Line 8, cluster name unknown, column 'cluster_name', error: 'String should have at least 1 character', received ''
ERROR Line 13, cluster AU98342CLS01, column 'cluster_group', error: 'Input should be 'prod-us', 'nonprod-us', 'prod-au' or 'nonprod-au'', received 'prod-nz'
ERROR Line 15, cluster AU73291CLS01, column 'cluster_tags', error: 'Input should be '24/7', 'corp', 'drivethru', 'drivethruduallane' or 'donotupgrade'', received 'ThisTagIsInvalid'
Each of the models has a source of truth counterpart including both valid and invalid data for testing and demonstration purposes.
For a functional example of extending the base, see the example
model: models/example_model.py, class NewBase
.
The base model, class BaseCluster
, provides a reference implementation of a Pydantic model for the required columns
in any source of truth CSV file, including columns:
cluster_name
cluster_group
cluster_tags
The base model valid groups and tags are stubbed and incomplete. In order to use them fully, an engineer must perform either of the below steps:
- Update the
ValudClusterGroups
andValidTags
enumerations in thecsv_validator
module in place, or - Subclass
BaseCluster
, implementing new enumerations, fully articulating all the possible valid tags and cluster groups. An example of this is demonstrated in models/example_model.py.
There are numerous examples of how one may write their own models:
- models/example_model.py shows how to create a new BaseCluster by subclassing it to customize valid groups and tags, then uses it to validate a number of example CSV columns each with unique properties
- models/cluster_registry.py which subclasses and is based on the
BaseCluster
model ( in models/cluster_registry.py) - models/platform.py which subclasses and is based on the
BaseCluster
model ( in models/cluster_registry.py)
This CSV validator is a Python module, providing a CLI, workflow, and things to import and use in model development. Notice its utilization in the examples - one may import from, extend, and update this library. The goal was (and is) enabling users to write extensible, flexible, consumable Python code. It is expected that you utilize it in the way that best suits your needs. In order to fully maximize the capability of this app, familiarity with Python is greatly beneficial.
Note: this presumes you have followed the installation instructions above and have the CLI (in the csv_validator
module) installed into your development environment.
- Start by updating csv_validator/model.py. The
BaseCluster
class refers to enumerations,ValidClusterGroups
andValidTags
. These are stubbed. Enumerate (quite literally) the values that are valid for your use case.
class ValidClusterGroups(enum.StrEnum):
""" Contains all values considered to be valid cluster group names """
prod_us = 'prod-us'
nonprod_us = 'nonprod-us'
prod_au = 'prod-au'
nonprod_au = 'nonprod-au'
# adding groups
prod_ca = 'prod-ca'
prod_mx = 'prox-mx'
prod_uk = 'prod-uk'
...
Do the same for tags:
class ValidTags(enum.StrEnum):
""" Contains all values considered to be valid tags """
TwentyFourSeven = '24/7'
DriveThru = 'drivethru'
Corp = 'corp'
DoNotUpgrade = 'donotupgrade'
# adding tags
Franchise = 'franchise'
...
- Create a model python file - in this example
custom.py
. Import theBaseCluster
,pydantic
.
import pydantic
from csv_validator.model import BaseCluster
- Create a custom model. Be sure it uses the class name
SourceOfTruthModel
and that it subclassesBaseCluster
class SourceOfTruthModel(BaseCluster):
my_field: str
This creates a new required column called my_field
using a type str.
Validating schema models are simply Pydantic models. There is no magic to the model - it must simply be valid usage of Pydantic in Python code - it's that simple. For more information, refer to the Pydantic documentation on models
- Provide a CSV that provides this column, while adhering to the model
BaseCluster
. Remember, we're using inheritance here to combine theBaseCluster
with your customSourceOfTruthModel
. Let's usecustom.csv
as the name.
cluster_name,cluster_group,cluster_tags,my_field
my-cluster,prod-us,"corp",this is my field
- Run the CLI importing the model (
custom.py
) against the CSV
validate_csv -m path/to/custom.py path/to/custom.csv
It should exit 0 if okay - you can run -v
for extra verbosity to verify!
$ validate_csv -m path/to/custom.py path/to/custom.csv
INFO CSV is valid
Install the module in editable mode with dev dependencies:
pip install -e .[dev]
Need to wipe your virtualenv to install from scratch?
# ensure you're in the virtualenv
source bin/activate
pip uninstall -y csv_validator
pip uninstall -y -r <(pip freeze)
pip install -e .[dev]
Pylint and mypy checks are expected to pass:
pylint src
mypy src
Run unit tests from the repository root.
Note: Ensure tests are run with csv_validator installed to current Python environment/virtualenv. See installation section.
python3 -m unittest tests/*.py -v
Build the container:
docker build --pull --no-cache -t csv-validator .
Test the container:
$ docker run -it csv-validator --help
usage: validate_csv [-h] [-m MODULE_OR_PYTHON_FILE] [-o output_source_of_truth.csv] [-v] SOURCE_OF_TRUTH.CSV
Validate source of truth CSV schemas and data using built-in and dynamically-imported validation models
...
An error similar to the following may occur while executing the command python3 -m pip install
if you have multiple
python
versions installed.
ERROR: Package hydrate requires a different Python: 3.11.9 not in >=3.12
If so, execute the following command to point python3
to the correct version.
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 1