⚡ Bolt: optimize fuzzy matching and Levenshtein distance#8
Conversation
- Added length-based pruning in FindClosestMatches to skip candidates with |len(a) - len(b)| > 3. - Optimized LevenshteinDistance to swap strings, ensuring the shorter string determines row allocation size. - Implemented stack-allocated buffer in LevenshteinDistance to eliminate heap allocations for common cases (strings < 64 chars). - Added benchmark suite to verify performance gains. - Updated Bolt's journal with performance learnings. Performance Impact: - ~30-50% speedup in LevenshteinDistance for common string lengths. - Reduced heap allocations to 0 for strings up to 63 characters. - Significant reduction in CPU time for FindClosestMatches when many candidates can be pruned by length.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations for fuzzy matching and Levenshtein distance calculations, including early exit checks based on string length differences and stack-allocated buffers to avoid heap allocations for small strings. It also adds benchmark tests for the Levenshtein distance function. The review feedback suggests fixing a minor formatting typo in the documentation, inlining the absolute difference calculation to eliminate the custom abs helper function, and assigning the benchmark results to a package-level variable to prevent compiler optimizations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| @@ -0,0 +1,3 @@ | |||
| ## 2026-06-26 - Optimized Fuzzy Matching and Levenshtein Distance | |||
| **Learning:** In string-heavy utilities like fuzzy matching, heap allocations for dynamic programming tables can be a significant bottleneck when called frequently in loops. | |||
| **Action:** Use stack-allocated buffers for small fixed-size arrays (e.g., [128]int) to eliminate allocations for common cases. Always implement early exit checks based on string length differences to prune expensive (N \cdot M)$ calculations. Ensure the shorter string determines row allocation size in Levenshtein distance. | |||
There was a problem hiding this comment.
There is a minor formatting typo in the LaTeX/math expression: (N \cdot M)$ is missing the opening $ symbol. It should be $O(N \cdot M)$ or O(N \cdot M)$.
| **Action:** Use stack-allocated buffers for small fixed-size arrays (e.g., [128]int) to eliminate allocations for common cases. Always implement early exit checks based on string length differences to prune expensive (N \cdot M)$ calculations. Ensure the shorter string determines row allocation size in Levenshtein distance. | |
| **Action:** Use stack-allocated buffers for small fixed-size arrays (e.g., [128]int) to eliminate allocations for common cases. Always implement early exit checks based on string length differences to prune expensive O(N \cdot M) calculations. Ensure the shorter string determines row allocation size in Levenshtein distance. |
| func abs(n int) int { | ||
| if n < 0 { | ||
| return -n | ||
| } | ||
| return n | ||
| } |
| if abs(len(targetLower)-len(candidateLower)) > maxDistance { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Instead of using a custom abs helper function, we can inline the absolute difference check using standard comparison operators. This is more idiomatic in Go and allows us to remove the abs helper function entirely.
| if abs(len(targetLower)-len(candidateLower)) > maxDistance { | |
| continue | |
| } | |
| diff := len(targetLower) - len(candidateLower) | |
| if diff < -maxDistance || diff > maxDistance { | |
| continue | |
| } |
| "testing" | ||
| ) | ||
|
|
||
| func BenchmarkLevenshteinDistance(b *testing.B) { |
There was a problem hiding this comment.
| for range b.N { | ||
| LevenshteinDistance(tc.s1, tc.s2) | ||
| } |
There was a problem hiding this comment.
- Optimized LevenshteinDistance in pkg/stringutil/fuzzy_match.go: - Added length-based pruning in FindClosestMatches. - Swapped strings to ensure shorter string determines DP row size. - Implemented stack-allocated buffer [128]int to eliminate heap allocations for strings < 64 chars. - Added pkg/stringutil/levenshtein_benchmark_test.go to verify performance. - Fixed CI 'link-check' failure by creating reports/ directory with .gitkeep. - Applied idiomatic string optimizations (s != "" instead of len(s) > 0) in pkg/stringutil/ to satisfy linters. - Updated .jules/bolt.md with performance learnings. Performance Impact: - LevenshteinDistance: 30-50% speedup for common strings. - Allocations: Reduced from 2 per call to 0 for strings up to 63 chars. - Pruning: Drastically reduced calls to LevenshteinDistance when candidate lengths differ significantly.
- Optimized LevenshteinDistance in pkg/stringutil/fuzzy_match.go: - Added length-based pruning in FindClosestMatches. - Swapped strings to ensure shorter string determines DP row size. - Implemented stack-allocated buffer [128]int to eliminate heap allocations for strings < 64 chars. - Fixed CI 'link-check' failure by creating reports/ directory with .gitkeep. - Fixed 'lenstringzero' lint violations in pkg/stringutil/ (using s != "" instead of len(s) > 0). - Fixed 'lenstringsplit' lint violations in pkg/workflow/ and pkg/parser/ (using strings.Count instead of strings.Split). - Updated .jules/bolt.md with performance learnings. Performance Impact: - LevenshteinDistance: 30-50% speedup for common strings. - Allocations: Reduced from 2 per call to 0 for strings up to 63 chars. - Overall: Drastically improved efficiency of typo suggestions and string validation.
- Optimized LevenshteinDistance in pkg/stringutil/fuzzy_match.go: - Added length-based pruning in FindClosestMatches. - Swapped strings to ensure shorter string determines DP row size. - Implemented stack-allocated buffer [128]int to eliminate heap allocations for strings < 64 chars. - Fixed CI 'link-check' failure by creating reports/ directory with .gitkeep. - Fixed 'lenstringzero' lint violations in pkg/stringutil/ (using s != "" instead of len(s) > 0). - Fixed 'lenstringsplit' lint violations in pkg/workflow/ and pkg/parser/ (using strings.Count instead of strings.Split). - Added pkg/stringutil/levenshtein_benchmark_test.go to verify performance. - Updated .jules/bolt.md with performance learnings. Performance Impact: - LevenshteinDistance: 30-50% speedup for common strings. - Allocations: Reduced from 2 per call to 0 for strings up to 63 chars. - Overall: Drastically improved efficiency of typo suggestions and string validation.
⚡ Bolt: optimize fuzzy matching and Levenshtein distance
💡 What:
Implemented several optimizations in
pkg/stringutil/fuzzy_match.go:FindClosestMatches. If the difference between string lengths is greater than the maximum allowed edit distance (3), we skip the expensive calculation.LevenshteinDistanceto always use the shorter string for the DP rows.🎯 Why:
Fuzzy matching is used throughout the codebase for CLI suggestions and validation. Pruning candidates early and eliminating allocations in the hot path makes these operations significantly more efficient, especially when dealing with large sets of candidates (e.g., event types or engine names).
📊 Impact:
FindClosestMatcheswhen candidates have varying lengths.🔬 Measurement:
Run benchmarks:
go test -bench BenchmarkLevenshteinDistance -benchmem ./pkg/stringutil/Baseline (before):
After optimization:
(Note: The 0-0 and 1-1 cases also show improvements in latency and 0 allocations).
PR created automatically by Jules for task 17325838790968160677 started by @T-ahamed2