perf: use multiply-31 hash comb. for 11% speed up#16
Conversation
Replace DefaultHasher with simple `hash * 31 + value` combining in maximize_collision. Reduces runtime from 22.0s to 19.5s on large workloads while maintaining identical results. reference Knuth TAOCP Vol.3 §6.4 on hashing Refer to "The Art of Computer Programming, Volume 3 (2nd ed.)", Section 6.4 — Hashing (pp. 506–549). Relevant subsections: - 6.4 "Hashing by Multiplication and Division" (pp. 513–519) - 6.4 "Hashing by Polynomial Accumulation" (pp. 521–523) Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
📝 WalkthroughWalkthroughAdds CLI flags to exclude directories and print version, updates file discovery to skip canonicalized excluded directories, introduces a build script (vergen-gix) to embed Git metadata, bumps package version, and adds Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (user)
participant Main as main.rs
participant Discover as files_to_process
participant FS as Filesystem
CLI->>Main: parse args (globs, exclude_dirs, version)
alt version flag set
Main->>Main: print_version()
Main-->>CLI: exit
else
Main->>Discover: files_to_process(globs, exclude_dirs)
Discover->>Discover: canonicalize exclude_dirs
Discover->>FS: glob & iterate files
loop for each file
Discover->>FS: canonicalize file path
alt file path starts with excluded dir
Discover-->>FS: skip file
else
Discover-->>Main: register file
end
end
Discover-->>Main: return files
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/main.rs (2)
291-291: Note: Changed hash values will be visible to users.The
keyfield is displayed to users (line 346) and used for ignore file matching (line 337). Users will see different hash signatures for the same duplicates after this change.Consider adding a note in the help text or documentation about hash value changes between versions if users might have saved or shared hash signatures.
293-293: Nice simplification!Direct vector initialization is cleaner and more efficient than creating an intermediate vector. The code is more readable and avoids an unnecessary allocation.
The comment on line 265 could be more precise—compiler optimization of
x * 31to(x << 5) - xis not guaranteed and may not provide benefits on modern CPUs where integer multiplication is highly optimized. Consider simplifying to just explain the algorithm choice:- // Multiply-add combining - compiler optimizes 31 to (hash << 5) - hash + // Multiply-add combining using constant 31 (from Knuth) let mut hash_key: u64 = 0;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/main.rs(2 hunks)
🔇 Additional comments (1)
src/main.rs (1)
264-276: Confirm: Hash implementation is correct, but breaking change requires attention.The multiply-31 combining with wrapping arithmetic is sound. However, this definitively changes all hash values—users with existing ignore files must regenerate them.
Verified findings:
Collision.keystores the multiply-31 combined hash (line 291)- Ignore files are matched via exact hash comparison:
ignore_hashes.contains_key(&p.key)(line 337)- No version markers, migration logic, or format compatibility checks exist in the codebase
- "Identical results" in the PR description means the same duplicates are detected, not bit-identical hashes
Required before merge:
- Confirm regression tests pass with new hash computation
- Add a note in release documentation about the ignore file format change
- Consider adding a clear error message if users attempt to load old ignore files (optional but recommended)
Sometimes it's useful to exclude directories when working through a new code base, so that you can focus initial efforts. Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
This is useful while I'm developing the tool and using custom local builds for testing. To remember what the binary was built from. Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/main.rs (1)
265-275: Harden collision grouping key to reduce accidental merges.
hash * 31 + valueis fast but weaker;results_hashgrouping bykey(see Line 404) can merge unrelated collisions on rare key clashes. Consider adding a secondary discriminator (e.g., includenum_linesand boundary hashes) in the grouping key.Also applies to: 291-293
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.rs` around lines 265 - 275, The current collision grouping key built via hash_key (in the loop computing hash_key using wrapping_mul(31).wrapping_add(...)) is too weak; modify the key construction (and any place that uses results_hash grouping by key) to include a secondary discriminator: incorporate the total number of lines (e.g., l_info.num_lines or a shared num_lines value) and boundary hashes (e.g., first/last element hashes from l_h and r_h or hashes of the first/last lines) into the final key before inserting into results_hash to reduce accidental merges; update the code paths that compute hash_key (the loop using l_h[r_index]/r_h[l_index] and the similar block around lines 291-293) to combine these additional fields (e.g., fold them into hash_key with wrapping operations or a small keyed mix) so results_hash grouping by key uniquely separates different-length or boundary-different collisions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitignore:
- Line 11: Remove the ".github" entry from the repository-level .gitignore so CI
workflows and repository templates are tracked; locate the ".gitignore" file
that contains the line ".github" and delete that line (if you need to ignore
files only inside specific subfolders, add localized .gitignore files inside
those subfolders instead) to ensure workflow templates and actions in the
.github directory are committed and versioned.
In `@src/main.rs`:
- Around line 589-597: The code currently swallows canonicalize errors in the
filter_map closure (canonicalize(dir)) and prints a warning; instead treat
invalid --exclude-dir paths as input errors by failing fast: replace the
filter_map/canonicalize pattern with a mapping that returns Result<PathBuf,
io::Error> (e.g., map(|dir| canonicalize(dir).map_err(|e| format!("Invalid
--exclude-dir '{}': {}", dir, e)))) and then collect into a Result<Vec<PathBuf>,
_> (or propagate the first error) so the error is returned from main (or causes
a non-zero exit) rather than being converted to None; update the code paths that
consume the excluded dirs to handle the Result accordingly and ensure the error
message references the offending dir and canonicalize error.
---
Nitpick comments:
In `@src/main.rs`:
- Around line 265-275: The current collision grouping key built via hash_key (in
the loop computing hash_key using wrapping_mul(31).wrapping_add(...)) is too
weak; modify the key construction (and any place that uses results_hash grouping
by key) to include a secondary discriminator: incorporate the total number of
lines (e.g., l_info.num_lines or a shared num_lines value) and boundary hashes
(e.g., first/last element hashes from l_h and r_h or hashes of the first/last
lines) into the final key before inserting into results_hash to reduce
accidental merges; update the code paths that compute hash_key (the loop using
l_h[r_index]/r_h[l_index] and the similar block around lines 291-293) to combine
these additional fields (e.g., fold them into hash_key with wrapping operations
or a small keyed mix) so results_hash grouping by key uniquely separates
different-length or boundary-different collisions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8e7b55ab-6a1b-4bfd-a97a-0e38f0fdc0dd
📒 Files selected for processing (4)
.gitignoreCargo.tomlbuild.rssrc/main.rs
|
|
||
| # These are backup files generated by rustfmt | ||
| **/*.rs.bk | ||
| .github |
There was a problem hiding this comment.
Do not ignore .github at the repository level.
This prevents tracking workflows/templates in version control and can silently break team CI/release updates.
Suggested fix
-.github📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| .github |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.gitignore at line 11, Remove the ".github" entry from the repository-level
.gitignore so CI workflows and repository templates are tracked; locate the
".gitignore" file that contains the line ".github" and delete that line (if you
need to ignore files only inside specific subfolders, add localized .gitignore
files inside those subfolders instead) to ensure workflow templates and actions
in the .github directory are committed and versioned.
| .filter_map(|dir| match canonicalize(dir) { | ||
| Ok(path) => Some(path), | ||
| Err(e) => { | ||
| eprintln!( | ||
| "WARNING: Unable to canonicalize excluded directory {}, reason: {}", | ||
| dir, e | ||
| ); | ||
| None | ||
| } |
There was a problem hiding this comment.
Treat invalid --exclude-dir paths as input errors, not warnings.
Current behavior warns and proceeds, which can silently include directories the user explicitly intended to exclude.
Suggested fix
.filter_map(|dir| match canonicalize(dir) {
Ok(path) => Some(path),
Err(e) => {
eprintln!(
"WARNING: Unable to canonicalize excluded directory {}, reason: {}",
dir, e
);
- None
+ process::exit(1);
}
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| .filter_map(|dir| match canonicalize(dir) { | |
| Ok(path) => Some(path), | |
| Err(e) => { | |
| eprintln!( | |
| "WARNING: Unable to canonicalize excluded directory {}, reason: {}", | |
| dir, e | |
| ); | |
| None | |
| } | |
| .filter_map(|dir| match canonicalize(dir) { | |
| Ok(path) => Some(path), | |
| Err(e) => { | |
| eprintln!( | |
| "WARNING: Unable to canonicalize excluded directory {}, reason: {}", | |
| dir, e | |
| ); | |
| process::exit(1); | |
| } | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/main.rs` around lines 589 - 597, The code currently swallows canonicalize
errors in the filter_map closure (canonicalize(dir)) and prints a warning;
instead treat invalid --exclude-dir paths as input errors by failing fast:
replace the filter_map/canonicalize pattern with a mapping that returns
Result<PathBuf, io::Error> (e.g., map(|dir| canonicalize(dir).map_err(|e|
format!("Invalid --exclude-dir '{}': {}", dir, e)))) and then collect into a
Result<Vec<PathBuf>, _> (or propagate the first error) so the error is returned
from main (or causes a non-zero exit) rather than being converted to None;
update the code paths that consume the excluded dirs to handle the Result
accordingly and ensure the error message references the offending dir and
canonicalize error.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Line 43: The README claims excluded directories must be relative but the code
in main.rs canonicalizes any path; update files_to_process to enforce the
relative-path rule: validate each entry of exclude_dirs (the input vector used
to build excluded_paths) and return an error (or skip with a clear message) if a
provided path is absolute or resolves outside the start directory; use
Path::is_absolute() to reject absolute inputs and after canonicalize ensure the
canonicalized path starts_with the start directory canonical path before adding
to excluded_paths (references: files_to_process, exclude_dirs, excluded_paths,
canonicalize).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6939d3ff-bc4d-40db-9d50-536d8672b1c3
📒 Files selected for processing (2)
Cargo.tomlREADME.md
🚧 Files skipped from review as they are similar to previous changes (1)
- Cargo.toml
| -f, --file <pattern or specific file> pattern or file eg. "**/*.[h|c]" recursive, "*.py", "file.ext", can repeat [required] | ||
| -i, --ignore <file name> file containing hash values to ignore, one per line | ||
| -t, --threads <thread number> number of threads to utilize. Set to 0 to match #cpu cores [default: 4] | ||
| -x, --exclude-dir <directory path> directory to exclude (repeatable). Path must be relative to the start directory |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the README.md documentation
head -50 README.md | tail -20Repository: tasleson/duplihere
Length of output: 1250
🏁 Script executed:
# Search for the canonicalize logic in src/main.rs and surrounding context
rg -n -A10 -B5 'exclude_dirs' src/main.rs | head -60Repository: tasleson/duplihere
Length of output: 2036
🏁 Script executed:
# Check for any path validation logic for relative paths
rg -n 'is_relative|is_absolute|relative|canonicalize' src/main.rsRepository: tasleson/duplihere
Length of output: 399
🏁 Script executed:
# Get the specific implementation around lines 711-740
sed -n '700,750p' src/main.rsRepository: tasleson/duplihere
Length of output: 1509
Documentation claims relative paths are required, but implementation accepts any path.
The documentation at line 755 states "Path must be relative to the start directory," but the implementation in src/main.rs:587-590 canonicalizes all excluded directories to absolute paths without enforcing or validating the relative path requirement:
let excluded_paths: Vec<std::path::PathBuf> = exclude_dirs
.iter()
.filter_map(|dir| match canonicalize(dir) {
Ok(path) => Some(path),
// ...
})
.collect();The code accepts both relative and absolute paths and converts them all to canonical absolute paths. No validation exists to enforce the relative path constraint.
Either:
- Remove the relative path requirement from the documentation, or
- Add validation in
files_to_processto enforce that input paths are relative
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` at line 43, The README claims excluded directories must be
relative but the code in main.rs canonicalizes any path; update files_to_process
to enforce the relative-path rule: validate each entry of exclude_dirs (the
input vector used to build excluded_paths) and return an error (or skip with a
clear message) if a provided path is absolute or resolves outside the start
directory; use Path::is_absolute() to reject absolute inputs and after
canonicalize ensure the canonicalized path starts_with the start directory
canonical path before adding to excluded_paths (references: files_to_process,
exclude_dirs, excluded_paths, canonicalize).
Replace DefaultHasher with simple
hash * 31 + valuecombining in maximize_collision. Reduces runtime from 22.0s to 19.5s on large workloads while maintaining identical results.reference Knuth TAOCP Vol.3 §6.4 on hashing
Refer to "The Art of Computer Programming, Volume 3 (2nd ed.)", Section 6.4 — Hashing (pp. 506–549).
Relevant subsections:
Summary by CodeRabbit
New Features
Chores
Documentation
Refactor