Skip to content

feat(hallucination-tracker): populate enriched clustering fields during ingestion#162

Merged
judeper merged 1 commit into
mainfrom
chore/157-enriched-clustering
May 18, 2026
Merged

feat(hallucination-tracker): populate enriched clustering fields during ingestion#162
judeper merged 1 commit into
mainfrom
chore/157-enriched-clustering

Conversation

@judeper
Copy link
Copy Markdown
Owner

@judeper judeper commented May 18, 2026

Summary

Closes #157. Final piece of the #127 follow-up trilogy. Builds on PR #159 (#155 dedup), PR #160 (#154 ARM bump), and PR #161 (#156 importer) by enriching Product Feedback ingestion with analyzer-ready clustering fields.

Clustering strategy

Deterministic structured-key clustering computed during ingestion:

  • Cluster label written to fsi_topicid derived from a normalized hash of (app, feature, channel, category, comment)
  • Identical inputs produce identical cluster labels (deterministic)
  • Malformed / missing input falls back to record-<hash> per-record cluster — no crash

Fields populated during ingestion

  • fsi_topicname — human-readable topic label
  • fsi_topicid — deterministic cluster label
  • fsi_channelid — defaults to m365copilot when CSV doesn't specify
  • fsi_feedbackcomment — raw comment text
  • fsi_reportedat — normalized timestamp from Date Submitted
  • fsi_conversationid — preserved when CSV provides it

Changes

  • hallucination-tracker/scripts/import_product_feedback_csv.py — extended to populate clustering fields
  • hallucination-tracker/scripts/analyze_patterns.py — adjusted to consume new field shape
  • hallucination-tracker/scripts/create_ht_*.py — updated alongside
  • hallucination-tracker/tests/test_import_product_feedback_csv.py — 5 new clustering tests
  • hallucination-tracker/tests/fixtures/product-feedback-clustering.csv — fixture
  • hallucination-tracker/scripts/test_analyze_patterns.py — updated to consume new fields
  • hallucination-tracker/docs/{pattern-analysis,source-configuration,troubleshooting}.md — updated documentation
  • hallucination-tracker/README.md + CHANGELOG.md — documentation

Tests added (5 clustering regressions)

  1. Field population: every imported record has all 6 fsi_* clustering fields set
  2. Same-cluster determinism: identical input rows produce identical fsi_topicid
  3. Different-cluster separation: clearly-different rows produce different fsi_topicid
  4. Malformed fallback: missing/null field falls back to record-<hash> without crashing
  5. Analyzer consumability: importer output is consumable by analyze_patterns.py end-to-end

Validation

  • pytest -q hallucination-tracker/tests: 9 passed
  • ruff check hallucination-tracker/scripts hallucination-tracker/tests: clean
  • ✅ Manual smoke: importer + analyzer pipeline runs end-to-end on fixture

Clarifications

Trilogy complete

Issue PR What it added
#154 #160 CSA workbook apiVersion bump
#155 #159 Session dedup hardening
#156 #161 Product Feedback CSV importer
#157 (this PR) Enriched clustering during ingestion

Plus #158 — the deferred-bucket documentation issue (no PR; just the docs).

Closes #157

…ng ingestion (closes #157)

Builds on PR #161 (issue #156) which added the M365 Product Feedback
CSV importer. This PR extends ingestion to populate clustering fields
so the analyzer has everything it needs for per-cluster aggregation
without additional manual prep.

Clustering strategy:
- Structured keys plus deterministic hashes over app, feature, channel, category, and normalized feedback text
- Deterministic: identical inputs get identical cluster labels
- Graceful: malformed input falls back to per-record cluster (no crash)

Changes:
- hallucination-tracker/scripts/import_product_feedback_csv.py:
  extended to populate clustering fields per ingested record
- hallucination-tracker/tests/test_import_product_feedback_csv.py:
  5 new regression tests
- hallucination-tracker/README.md and hallucination-tracker/docs/*.md:
  documented clustering fields + pipeline

Fields populated (per record):
- fsi_topicname
- fsi_topicid
- fsi_channelid
- fsi_feedbackcomment
- fsi_reportedat
- fsi_conversationid

Validation:
- pytest: pass
- ruff (scope): clean
- Manual smoke: importer + analyzer pipeline runs end-to-end on fixture

Closes #157

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 944603c1f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

f"m365pf-{app_key}-{feature_key}-{channel_key}-"
f"{category_key}-record-{digest}"
)
return limit_length(cluster_id, MAX_TOPIC_ID_LENGTH)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the cluster hash when trimming topic IDs

When the app/feature/channel/category components are long enough to push the generated m365pf-* value over 200 characters, this final truncation drops the trailing digest that is meant to make each normalized feedback signal unique. In those CSVs, two distinct comments that share the same leading signal words can be assigned the same fsi_topicid, so the analyzer will over-group unrelated Product Feedback rows; shorten earlier components or reserve space for the hash before enforcing the Dataverse length limit.

Useful? React with 👍 / 👎.

@judeper judeper merged commit 2010602 into main May 18, 2026
10 checks passed
@judeper judeper deleted the chore/157-enriched-clustering branch May 18, 2026 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hallucination-tracker: populate enriched clustering fields during transcript feedback ingestion

1 participant