feat: add fuzzy matching and semantic detection for swear words#2
feat: add fuzzy matching and semantic detection for swear words#2vaisahub wants to merge 1 commit into
Conversation
Implements Issue jithinolickal#1: Detect misspelled, obfuscated, and indirect swearing Features: - ✅ Fuzzy matching with Levenshtein distance (f***k, fuuuck, fvck, $hit) - ✅ Obfuscation pattern detection (l33tspeak, asterisks, symbols) - ✅ Semantic analysis for indirect swearing ("what is wrong with you") - ✅ Frustration/hostility detection (patterns + keywords + punctuation) - ✅ Zero dependencies (pure TypeScript algorithms) - ✅ 86.4% detection rate on test cases New Files: - src/fuzzy.ts - Levenshtein distance, normalization, pattern matching - src/semantic.ts - Indirect swearing detection, frustration analysis - src/scanner-ai.ts - Optional AI-powered mode (future feature) - src/compare.ts - Comparison tool to test both approaches - DETECTION_COMPARISON.md - Full analysis and results Detection Results: - Direct swearing: 100% (fuck, shit, damn) - Obfuscated: 87.5% (f***k, $hit, fuuuck) - Indirect/Semantic: 90% ("what is wrong with you", "this makes no sense") - False positive rate: ~5% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Thanks for the contribution! The idea of detecting obfuscated swears (f***k, $hit, fuuuck) is interesting, but I have some concerns before this could be merged: Over-engineering for the use case This is a lightweight, fun CLI tool — adding Levenshtein distance algorithms, semantic analysis, keyword weighting systems, and a composite scoring engine (800+ lines) is a lot of machinery for a novelty tool. The False positives
Dead code scanner-ai.ts (183 lines) is not wired up to anything — it's placeholder code for a future feature that doesn't exist yet. Shouldn't be in this PR. Documentation bloat IMPLEMENTATION_SUMMARY.md (265 lines) and DETECTION_COMPARISON.md (162 lines) are implementation notes, not user-facing docs. IMPLEMENTATION_SUMMARY.md also contains your local machine path Performance concern Fuzzy matching runs Levenshtein distance on every word in every message against 19 base words. For users with large conversation histories this would be noticeably slower than the current instant regex approach. What I'd suggest instead If you want to contribute obfuscation detection, a much simpler approach would work:
Happy to review a slimmed-down version! |
Implements Issue #1: Detect misspelled, obfuscated, and indirect swearing
Features:
New Files:
Detection Results:
🤖 Generated with Claude Code