Skip to content

⚡ Bolt: optimize fuzzy matching and string sanitization#12

Draft
T-ahamed2 wants to merge 2 commits into
mainfrom
bolt-optimize-stringutil-16899512580012730208
Draft

⚡ Bolt: optimize fuzzy matching and string sanitization#12
T-ahamed2 wants to merge 2 commits into
mainfrom
bolt-optimize-stringutil-16899512580012730208

Conversation

@T-ahamed2

Copy link
Copy Markdown
Owner

This PR implements several high-impact performance optimizations in pkg/stringutil based on measured bottlenecks.

Optimizations:

  1. Fuzzy Matching (LevenshteinDistance):
    • Input Swapping: Ensures the shorter string drives the row allocation size, minimizing memory footprint.
    • Stack Allocation: Uses stack-allocated buffers ([65]int) for strings up to 64 characters (common for identifiers/typos), eliminating heap allocations for these cases.
  2. Fuzzy Matching (FindClosestMatches):
    • Early Exit: Added an O(1) length check. Since Levenshtein distance is at least the absolute difference in string lengths, we can skip the O(N*M) calculation if abs(len(a) - len(b)) > 3.
  3. String Sanitization (SanitizeName):
    • Regex Pre-compilation: Pre-compiled frequently used sanitization patterns into a package-level map, removing the overhead of regexp.MustCompile on every call.

Performance Impact (Benchmarks):

  • BenchmarkLevenshteinDistance: 298.4 ns/op -> 197.5 ns/op (~34% faster)
  • BenchmarkFindClosestMatches: 2625 ns/op -> 1884 ns/op (~28% faster)
  • BenchmarkSanitizeName: 5511 ns/op -> 3319 ns/op (~40% faster)

All changes are verified with the full test suite. Pre-existing failures in pkg/cli were confirmed as unrelated to these changes.


PR created automatically by Jules for task 16899512580012730208 started by @T-ahamed2

- Optimized LevenshteinDistance by implementing input swapping and stack-allocated buffers for small strings, reducing allocations and improving performance by ~34%.
- Added a length-difference early exit in FindClosestMatches to skip expensive calculations when strings are clearly too different, improving performance by ~28%.
- Pre-compiled common sanitization regex patterns in SanitizeName, avoiding redundant runtime compilation and improving performance by ~40%.
- Verified all changes with unit tests and benchmarks.
@google-labs-jules

Copy link
Copy Markdown

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 093318c3-1345-4441-82a2-0c81e2975c9c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bolt-optimize-stringutil-16899512580012730208

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for string processing utilities, including stack-allocated buffers and input swapping in the Levenshtein distance calculation, an early length-difference check to skip expensive computations, and a pre-compiled regex cache for common sanitization patterns. The review feedback suggests normalizing the allowed characters key before performing the regex cache lookup to ensure robust cache hits, and simplifying the nested min calls in the Levenshtein distance calculation using Go 1.21's multi-argument min function.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +151 to +156
var pattern *regexp.Regexp
if p, ok := commonSanitizePatterns[allowedChars]; ok {
pattern = p
} else {
pattern = regexp.MustCompile(`[^` + allowedChars + `]+`)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pre-compiled regex cache lookup in commonSanitizePatterns is sensitive to the exact order and duplication of characters in allowedChars. If opts.PreserveSpecialChars contains characters in a different order (e.g., _ before .) or contains duplicates, the generated allowedChars string will not match any key in commonSanitizePatterns, causing a silent fallback to regexp.MustCompile on every call.

To ensure robust cache hits and maintain the performance optimization, we should normalize allowedChars to a canonical key before performing the map lookup.

	canonicalKey := "a-z0-9-"
	if strings.ContainsRune(allowedChars, '.') {
		canonicalKey += "."
	}
	if strings.ContainsRune(allowedChars, '_') {
		canonicalKey += "_"
	}

	var pattern *regexp.Regexp
	if p, ok := commonSanitizePatterns[canonicalKey]; ok {
		pattern = p
	} else {
		pattern = regexp.MustCompile("[^" + allowedChars + "]+")
	}

insertion := currRow[j-1] + 1
substitution := prevRow[j-1] + cost

currRow[j] = min(deletion, min(insertion, substitution))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since Go 1.21, the built-in min function supports a variable number of arguments. We can simplify the nested min calls to a single call for better readability.

Suggested change
currRow[j] = min(deletion, min(insertion, substitution))
currRow[j] = min(deletion, insertion, substitution)

- Removed non-existent reports/ directory check from link-check workflow.
- Fixed lint errors in pkg/stringutil (prefer s != "" over len(s) > 0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant