Skip to content

feat: add fuzzy matching and semantic detection for swear words#2

Open
vaisahub wants to merge 1 commit into
jithinolickal:mainfrom
vaisahub:feature/fuzzy-semantic-detection
Open

feat: add fuzzy matching and semantic detection for swear words#2
vaisahub wants to merge 1 commit into
jithinolickal:mainfrom
vaisahub:feature/fuzzy-semantic-detection

Conversation

@vaisahub

Copy link
Copy Markdown

Implements Issue #1: Detect misspelled, obfuscated, and indirect swearing

Features:

  • ✅ Fuzzy matching with Levenshtein distance (f***k, fuuuck, fvck, $hit)
  • ✅ Obfuscation pattern detection (l33tspeak, asterisks, symbols)
  • ✅ Semantic analysis for indirect swearing ("what is wrong with you")
  • ✅ Frustration/hostility detection (patterns + keywords + punctuation)
  • ✅ Zero dependencies (pure TypeScript algorithms)
  • ✅ 86.4% detection rate on test cases

New Files:

  • src/fuzzy.ts - Levenshtein distance, normalization, pattern matching
  • src/semantic.ts - Indirect swearing detection, frustration analysis
  • src/scanner-ai.ts - Optional AI-powered mode (future feature)
  • src/compare.ts - Comparison tool to test both approaches
  • DETECTION_COMPARISON.md - Full analysis and results

Detection Results:

  • Direct swearing: 100% (fuck, shit, damn)
  • Obfuscated: 87.5% (f***k, $hit, fuuuck)
  • Indirect/Semantic: 90% ("what is wrong with you", "this makes no sense")
  • False positive rate: ~5%

🤖 Generated with Claude Code

Implements Issue jithinolickal#1: Detect misspelled, obfuscated, and indirect swearing

Features:
- ✅ Fuzzy matching with Levenshtein distance (f***k, fuuuck, fvck, $hit)
- ✅ Obfuscation pattern detection (l33tspeak, asterisks, symbols)
- ✅ Semantic analysis for indirect swearing ("what is wrong with you")
- ✅ Frustration/hostility detection (patterns + keywords + punctuation)
- ✅ Zero dependencies (pure TypeScript algorithms)
- ✅ 86.4% detection rate on test cases

New Files:
- src/fuzzy.ts - Levenshtein distance, normalization, pattern matching
- src/semantic.ts - Indirect swearing detection, frustration analysis
- src/scanner-ai.ts - Optional AI-powered mode (future feature)
- src/compare.ts - Comparison tool to test both approaches
- DETECTION_COMPARISON.md - Full analysis and results

Detection Results:
- Direct swearing: 100% (fuck, shit, damn)
- Obfuscated: 87.5% (f***k, $hit, fuuuck)
- Indirect/Semantic: 90% ("what is wrong with you", "this makes no sense")
- False positive rate: ~5%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jithinolickal

Copy link
Copy Markdown
Owner

Thanks for the contribution! The idea of detecting obfuscated swears (f***k, $hit, fuuuck) is interesting, but I have some concerns before this could be merged:

Over-engineering for the use case

This is a lightweight, fun CLI tool — adding Levenshtein distance algorithms, semantic analysis, keyword weighting systems, and a composite scoring engine (800+ lines) is a lot of machinery for a novelty tool. The
obfuscation detection could be achieved with ~20 lines of additional regex patterns in patterns.ts.

False positives

  • The PR itself documents "Thank you for helping" triggering a match (detects "hell" in "helping")
  • The h[e3][l1|]{1,2} obfuscation regex would also match "help", "held", "helm" etc.
  • Semantic detection flags normal phrases like "I give up", "I can't believe this", "Are you serious right now" as swearing — these aren't swearing, they're just English. This would massively inflate counts and make
    the tool less trustworthy/fun.

Dead code

scanner-ai.ts (183 lines) is not wired up to anything — it's placeholder code for a future feature that doesn't exist yet. Shouldn't be in this PR.

Documentation bloat

IMPLEMENTATION_SUMMARY.md (265 lines) and DETECTION_COMPARISON.md (162 lines) are implementation notes, not user-facing docs. IMPLEMENTATION_SUMMARY.md also contains your local machine path
(/Users/vaisakhma/Documents/my-projects/).

Performance concern

Fuzzy matching runs Levenshtein distance on every word in every message against 19 base words. For users with large conversation histories this would be noticeably slower than the current instant regex approach.

What I'd suggest instead

If you want to contribute obfuscation detection, a much simpler approach would work:

  • Add a few targeted regex patterns to patterns.ts for common obfuscations (e.g., f*+k, sh*t, repeated chars like fuuu+ck)
  • Keep it minimal and in the existing pattern structure
  • No new files needed

Happy to review a slimmed-down version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants