CI / Telemetry: Investigate how to add logic in the CI to track the quality of synthetic data generation #516

courtneypacheco · 2025-01-28T17:36:42Z

Background / Context

We currently do not capture or assess the quality of our synthetic data generation (SDG). As a result, we do not know if the quality of the generated output has improved over time, regressed over time, or stayed the same, and we have nothing at the moment to indicate that a PR's proposed changes could impact quality.

Desired Outcomes

Going forward, we would like to increase our confidence in making more sweeping changes in SDG without manual testing and inspection of outputs.

This issue is fairly open-ended, but at a bare minimum, we need to:

Capture model quality
Find a place to store that quality information
Report the quality in an easily-digestible format ("easily-digestible" is subjective though, so best to verify with the maintainers)

As far as capturing model quality goes, this will require some investigation on the assignee's part, and will likely involve direct collaboration with the SDG maintainers. Same with reporting the quality. Specifically, when, how, and where to report it?

courtneypacheco · 2025-01-28T17:42:38Z

@bbrowning let me know if I missed anything here!

bbrowning · 2025-01-28T18:31:38Z

The only thing I'd add is it may not just be the model quality we care about, as we only see model quality after training. Our primary output is a generated dataset, and ideally we'd have some way to capture the quality of that dataset. This could be things like ensuring the number, length, and overall distribution of data samples matches some expectations. Or, it could be things like ensuring the generated samples are relevant to the topic, don't have extraneous characters or garbled text, and so on. I don't know exactly what we'd want to verify, but tracking the quality of our generated data over time on some known set of inputs would be a useful indicator.

courtneypacheco added enhancement New feature or request CI/CD Affects CI/CD configuration labels Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI / Telemetry: Investigate how to add logic in the CI to track the quality of synthetic data generation #516

CI / Telemetry: Investigate how to add logic in the CI to track the quality of synthetic data generation #516

courtneypacheco commented Jan 28, 2025

courtneypacheco commented Jan 28, 2025

bbrowning commented Jan 28, 2025

CI / Telemetry: Investigate how to add logic in the CI to track the quality of synthetic data generation #516

CI / Telemetry: Investigate how to add logic in the CI to track the quality of synthetic data generation #516

Comments

courtneypacheco commented Jan 28, 2025

Background / Context

Desired Outcomes

courtneypacheco commented Jan 28, 2025

bbrowning commented Jan 28, 2025