feat(hallucination-tracker): populate enriched clustering fields during ingestion#162
Conversation
…ng ingestion (closes #157) Builds on PR #161 (issue #156) which added the M365 Product Feedback CSV importer. This PR extends ingestion to populate clustering fields so the analyzer has everything it needs for per-cluster aggregation without additional manual prep. Clustering strategy: - Structured keys plus deterministic hashes over app, feature, channel, category, and normalized feedback text - Deterministic: identical inputs get identical cluster labels - Graceful: malformed input falls back to per-record cluster (no crash) Changes: - hallucination-tracker/scripts/import_product_feedback_csv.py: extended to populate clustering fields per ingested record - hallucination-tracker/tests/test_import_product_feedback_csv.py: 5 new regression tests - hallucination-tracker/README.md and hallucination-tracker/docs/*.md: documented clustering fields + pipeline Fields populated (per record): - fsi_topicname - fsi_topicid - fsi_channelid - fsi_feedbackcomment - fsi_reportedat - fsi_conversationid Validation: - pytest: pass - ruff (scope): clean - Manual smoke: importer + analyzer pipeline runs end-to-end on fixture Closes #157 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 944603c1f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| f"m365pf-{app_key}-{feature_key}-{channel_key}-" | ||
| f"{category_key}-record-{digest}" | ||
| ) | ||
| return limit_length(cluster_id, MAX_TOPIC_ID_LENGTH) |
There was a problem hiding this comment.
Preserve the cluster hash when trimming topic IDs
When the app/feature/channel/category components are long enough to push the generated m365pf-* value over 200 characters, this final truncation drops the trailing digest that is meant to make each normalized feedback signal unique. In those CSVs, two distinct comments that share the same leading signal words can be assigned the same fsi_topicid, so the analyzer will over-group unrelated Product Feedback rows; shorten earlier components or reserve space for the hash before enforcing the Dataverse length limit.
Useful? React with 👍 / 👎.
Summary
Closes #157. Final piece of the #127 follow-up trilogy. Builds on PR #159 (#155 dedup), PR #160 (#154 ARM bump), and PR #161 (#156 importer) by enriching Product Feedback ingestion with analyzer-ready clustering fields.
Clustering strategy
Deterministic structured-key clustering computed during ingestion:
fsi_topicidderived from a normalized hash of(app, feature, channel, category, comment)record-<hash>per-record cluster — no crashFields populated during ingestion
fsi_topicname— human-readable topic labelfsi_topicid— deterministic cluster labelfsi_channelid— defaults tom365copilotwhen CSV doesn't specifyfsi_feedbackcomment— raw comment textfsi_reportedat— normalized timestamp fromDate Submittedfsi_conversationid— preserved when CSV provides itChanges
hallucination-tracker/scripts/import_product_feedback_csv.py— extended to populate clustering fieldshallucination-tracker/scripts/analyze_patterns.py— adjusted to consume new field shapehallucination-tracker/scripts/create_ht_*.py— updated alongsidehallucination-tracker/tests/test_import_product_feedback_csv.py— 5 new clustering testshallucination-tracker/tests/fixtures/product-feedback-clustering.csv— fixturehallucination-tracker/scripts/test_analyze_patterns.py— updated to consume new fieldshallucination-tracker/docs/{pattern-analysis,source-configuration,troubleshooting}.md— updated documentationhallucination-tracker/README.md+CHANGELOG.md— documentationTests added (5 clustering regressions)
fsi_topicidfsi_topicidrecord-<hash>without crashinganalyze_patterns.pyend-to-endValidation
pytest -q hallucination-tracker/tests: 9 passedruff check hallucination-tracker/scripts hallucination-tracker/tests: cleanClarifications
fsi_topicidper the existing schema; documented in commit + READMETrilogy complete
Plus #158 — the deferred-bucket documentation issue (no PR; just the docs).
Closes #157