Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI / Telemetry: Investigate how to add logic in the CI to track the quality of synthetic data generation #516

Open
courtneypacheco opened this issue Jan 28, 2025 · 2 comments
Labels
CI/CD Affects CI/CD configuration enhancement New feature or request

Comments

@courtneypacheco
Copy link
Contributor

Background / Context

We currently do not capture or assess the quality of our synthetic data generation (SDG). As a result, we do not know if the quality of the generated output has improved over time, regressed over time, or stayed the same, and we have nothing at the moment to indicate that a PR's proposed changes could impact quality.

Desired Outcomes

Going forward, we would like to increase our confidence in making more sweeping changes in SDG without manual testing and inspection of outputs.

This issue is fairly open-ended, but at a bare minimum, we need to:

  • Capture model quality
  • Find a place to store that quality information
  • Report the quality in an easily-digestible format ("easily-digestible" is subjective though, so best to verify with the maintainers)

As far as capturing model quality goes, this will require some investigation on the assignee's part, and will likely involve direct collaboration with the SDG maintainers. Same with reporting the quality. Specifically, when, how, and where to report it?

@courtneypacheco courtneypacheco added enhancement New feature or request CI/CD Affects CI/CD configuration labels Jan 28, 2025
@courtneypacheco
Copy link
Contributor Author

@bbrowning let me know if I missed anything here!

@bbrowning
Copy link
Contributor

The only thing I'd add is it may not just be the model quality we care about, as we only see model quality after training. Our primary output is a generated dataset, and ideally we'd have some way to capture the quality of that dataset. This could be things like ensuring the number, length, and overall distribution of data samples matches some expectations. Or, it could be things like ensuring the generated samples are relevant to the topic, don't have extraneous characters or garbled text, and so on. I don't know exactly what we'd want to verify, but tracking the quality of our generated data over time on some known set of inputs would be a useful indicator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants