Skip to content

feat(semantic_dedup): configurable similarity threshold and shingle size#10

Open
rambalconnicz wants to merge 1 commit intoopen-compress:mainfrom
rambalconnicz:feat/configurable-dedup-threshold
Open

feat(semantic_dedup): configurable similarity threshold and shingle size#10
rambalconnicz wants to merge 1 commit intoopen-compress:mainfrom
rambalconnicz:feat/configurable-dedup-threshold

Conversation

@rambalconnicz
Copy link
Copy Markdown

Summary

The SemanticDedup stage uses a hardcoded Jaccard similarity threshold of 0.8 and shingle size of 3. This works well for general text, but different content types benefit from different dedup aggressiveness:

  • Log-heavy output: looser threshold (~0.6) catches more repeated patterns
  • Code blocks: stricter threshold (~0.9) avoids merging blocks that differ in identifiers
  • Conversation dedup: smaller shingles (2) catch short repeated phrases better

This PR makes these parameters configurable via SemanticDedup constructor args while preserving the existing defaults.

Changes

  • Add similarity_threshold, shingle_size, and min_block_chars constructor params to SemanticDedup
  • Thread parameters through _run_dedup() and _split_blocks()
  • All existing tests pass unchanged (defaults preserved)

Usage

from claw_compactor.fusion.semantic_dedup import SemanticDedup

# Aggressive dedup for logs
stage = SemanticDedup(similarity_threshold=0.6)

# Conservative dedup for code
stage = SemanticDedup(similarity_threshold=0.9, shingle_size=4)

…igurable

The SemanticDedup stage previously used hardcoded values for the Jaccard
similarity threshold (0.8) and shingle size (3). Different content types
benefit from different dedup aggressiveness -- e.g. log-heavy output works
better with a looser threshold (0.6) while code blocks need stricter matching
(0.9) to avoid false merges.

This adds optional constructor parameters:
- similarity_threshold: Jaccard cutoff (default 0.8, unchanged)
- shingle_size: n-gram width for fingerprinting (default 3)
- min_block_chars: minimum block length to consider (default 50)

All parameters are threaded through _run_dedup and _split_blocks. Existing
behavior is preserved when no arguments are passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant