feat(hallucination-tracker): populate enriched clustering fields during ingestion by judeper · Pull Request #162 · judeper/FSI-AgentGov-Solutions

judeper · 2026-05-18T00:20:16Z

Summary

Closes #157. Final piece of the #127 follow-up trilogy. Builds on PR #159 (#155 dedup), PR #160 (#154 ARM bump), and PR #161 (#156 importer) by enriching Product Feedback ingestion with analyzer-ready clustering fields.

Clustering strategy

Deterministic structured-key clustering computed during ingestion:

Cluster label written to fsi_topicid derived from a normalized hash of (app, feature, channel, category, comment)
Identical inputs produce identical cluster labels (deterministic)
Malformed / missing input falls back to record-<hash> per-record cluster — no crash

Fields populated during ingestion

fsi_topicname — human-readable topic label
fsi_topicid — deterministic cluster label
fsi_channelid — defaults to m365copilot when CSV doesn't specify
fsi_feedbackcomment — raw comment text
fsi_reportedat — normalized timestamp from Date Submitted
fsi_conversationid — preserved when CSV provides it

Changes

hallucination-tracker/scripts/import_product_feedback_csv.py — extended to populate clustering fields
hallucination-tracker/scripts/analyze_patterns.py — adjusted to consume new field shape
hallucination-tracker/scripts/create_ht_*.py — updated alongside
hallucination-tracker/tests/test_import_product_feedback_csv.py — 5 new clustering tests
hallucination-tracker/tests/fixtures/product-feedback-clustering.csv — fixture
hallucination-tracker/scripts/test_analyze_patterns.py — updated to consume new fields
hallucination-tracker/docs/{pattern-analysis,source-configuration,troubleshooting}.md — updated documentation
hallucination-tracker/README.md + CHANGELOG.md — documentation

Tests added (5 clustering regressions)

Field population: every imported record has all 6 fsi_* clustering fields set
Same-cluster determinism: identical input rows produce identical fsi_topicid
Different-cluster separation: clearly-different rows produce different fsi_topicid
Malformed fallback: missing/null field falls back to record-<hash> without crashing
Analyzer consumability: importer output is consumable by analyze_patterns.py end-to-end

Validation

✅ pytest -q hallucination-tracker/tests: 9 passed
✅ ruff check hallucination-tracker/scripts hallucination-tracker/tests: clean
✅ Manual smoke: importer + analyzer pipeline runs end-to-end on fixture

Clarifications

Issue text mentioned "transcript feedback ingestion" but the re-scoped Product Feedback importer (PR feat(hallucination-tracker): add Microsoft 365 Product Feedback CSV importer #161) is the actual ingestion surface; clustering work matches there
Inferred cluster label belongs in fsi_topicid per the existing schema; documented in commit + README

Trilogy complete

Issue	PR	What it added
#154	#160	CSA workbook apiVersion bump
#155	#159	Session dedup hardening
#156	#161	Product Feedback CSV importer
#157	(this PR)	Enriched clustering during ingestion

Plus #158 — the deferred-bucket documentation issue (no PR; just the docs).

Closes #157

…ng ingestion (closes #157) Builds on PR #161 (issue #156) which added the M365 Product Feedback CSV importer. This PR extends ingestion to populate clustering fields so the analyzer has everything it needs for per-cluster aggregation without additional manual prep. Clustering strategy: - Structured keys plus deterministic hashes over app, feature, channel, category, and normalized feedback text - Deterministic: identical inputs get identical cluster labels - Graceful: malformed input falls back to per-record cluster (no crash) Changes: - hallucination-tracker/scripts/import_product_feedback_csv.py: extended to populate clustering fields per ingested record - hallucination-tracker/tests/test_import_product_feedback_csv.py: 5 new regression tests - hallucination-tracker/README.md and hallucination-tracker/docs/*.md: documented clustering fields + pipeline Fields populated (per record): - fsi_topicname - fsi_topicid - fsi_channelid - fsi_feedbackcomment - fsi_reportedat - fsi_conversationid Validation: - pytest: pass - ruff (scope): clean - Manual smoke: importer + analyzer pipeline runs end-to-end on fixture Closes #157 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 944603c1f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T00:21:54Z

+            f"m365pf-{app_key}-{feature_key}-{channel_key}-"
+            f"{category_key}-record-{digest}"
+        )
+    return limit_length(cluster_id, MAX_TOPIC_ID_LENGTH)


Preserve the cluster hash when trimming topic IDs

When the app/feature/channel/category components are long enough to push the generated m365pf-* value over 200 characters, this final truncation drops the trailing digest that is meant to make each normalized feedback signal unique. In those CSVs, two distinct comments that share the same leading signal words can be assigned the same fsi_topicid, so the analyzer will over-group unrelated Product Feedback rows; shorten earlier components or reserve space for the hash before enforcing the Dataverse length limit.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

judeper merged commit 2010602 into main May 18, 2026
10 checks passed

judeper deleted the chore/157-enriched-clustering branch May 18, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hallucination-tracker): populate enriched clustering fields during ingestion#162

feat(hallucination-tracker): populate enriched clustering fields during ingestion#162
judeper merged 1 commit into
mainfrom
chore/157-enriched-clustering

judeper commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

judeper commented May 18, 2026

Summary

Clustering strategy

Fields populated during ingestion

Changes

Tests added (5 clustering regressions)

Validation

Clarifications

Trilogy complete

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant