perf: use multiply-31 hash comb. for 11% speed up by tasleson · Pull Request #16 · tasleson/duplihere

tasleson · 2025-11-05T14:17:02Z

Replace DefaultHasher with simple hash * 31 + value combining in maximize_collision. Reduces runtime from 22.0s to 19.5s on large workloads while maintaining identical results.

reference Knuth TAOCP Vol.3 §6.4 on hashing

Refer to "The Art of Computer Programming, Volume 3 (2nd ed.)", Section 6.4 — Hashing (pp. 506–549).

Relevant subsections:

6.4 "Hashing by Multiplication and Division" (pp. 513–519)
6.4 "Hashing by Polynomial Accumulation" (pp. 521–523)

Summary by CodeRabbit

New Features
- Add directory exclusion option to skip specified folders when processing files.
- Add a version flag to print build/version information.
Chores
- Add build-time metadata generation.
- Update project ignore rules and bump package version.
Documentation
- Update usage/help text to document new flags.
Refactor
- Adjusted internal collision/key handling for improved behavior.

Replace DefaultHasher with simple `hash * 31 + value` combining in maximize_collision. Reduces runtime from 22.0s to 19.5s on large workloads while maintaining identical results. reference Knuth TAOCP Vol.3 §6.4 on hashing Refer to "The Art of Computer Programming, Volume 3 (2nd ed.)", Section 6.4 — Hashing (pp. 506–549). Relevant subsections: - 6.4 "Hashing by Multiplication and Division" (pp. 513–519) - 6.4 "Hashing by Polynomial Accumulation" (pp. 521–523) Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

coderabbitai · 2025-11-05T14:17:12Z

📝 Walkthrough

Walkthrough

Adds CLI flags to exclude directories and print version, updates file discovery to skip canonicalized excluded directories, introduces a build script (vergen-gix) to embed Git metadata, bumps package version, and adds .github to .gitignore. (49 words)

Changes

Cohort / File(s)	Summary
CLI, options & discovery `src/main.rs`	Adds `exclude_dirs: Vec<String>` and `version: bool` to `Options`; new `print_version()`; CLI flags `-x/--exclude-dir` and `-v/--version`; `files_to_process` signature now accepts `exclude_dirs` and skips files whose canonical paths start with any excluded dir; propagates exclusions through main.
Build script & manifest `build.rs`, `Cargo.toml`	Adds `build.rs` using `vergen_gix` to emit Git metadata at build time; bumps package version to 0.10.0 and adds `vergen-gix = "9"` under `[build-dependencies]`.
Docs & repo config `README.md`, `.gitignore`	Updates README usage/help to document `-x/--exclude-dir` and `-v/--version`; adds `.github` to `.gitignore`.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (user)
    participant Main as main.rs
    participant Discover as files_to_process
    participant FS as Filesystem

    CLI->>Main: parse args (globs, exclude_dirs, version)
    alt version flag set
        Main->>Main: print_version()
        Main-->>CLI: exit
    else
        Main->>Discover: files_to_process(globs, exclude_dirs)
        Discover->>Discover: canonicalize exclude_dirs
        Discover->>FS: glob & iterate files
        loop for each file
            Discover->>FS: canonicalize file path
            alt file path starts with excluded dir
                Discover-->>FS: skip file
            else
                Discover-->>Main: register file
            end
        end
        Discover-->>Main: return files
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through flags and paths today,
Excluding burrows along the way.
Built with git-marks, version in paw,
Skipping folders I simply saw.
A nimble rabbit, CLI hooray! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title focuses on a performance optimization using multiply-31 hash combining, but the changeset includes significant feature additions (exclude_dirs, version flag) and infrastructure changes (build script, version bump) unrelated to hashing.	Update the title to reflect the primary changes: consider 'Add exclude_dirs support and version flag' or similar, as the hashing optimization appears secondary to the main feature additions in this changeset.
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch perf_improvement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/main.rs (2)
291-291: Note: Changed hash values will be visible to users.

The key field is displayed to users (line 346) and used for ignore file matching (line 337). Users will see different hash signatures for the same duplicates after this change.

Consider adding a note in the help text or documentation about hash value changes between versions if users might have saved or shared hash signatures.

293-293: Nice simplification!

Direct vector initialization is cleaner and more efficient than creating an intermediate vector. The code is more readable and avoids an unnecessary allocation.

The comment on line 265 could be more precise—compiler optimization of x * 31 to (x << 5) - x is not guaranteed and may not provide benefits on modern CPUs where integer multiplication is highly optimized. Consider simplifying to just explain the algorithm choice:
-    // Multiply-add combining - compiler optimizes 31 to (hash << 5) - hash
+    // Multiply-add combining using constant 31 (from Knuth)
     let mut hash_key: u64 = 0;

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb16065 and f986be4.

📒 Files selected for processing (1)

src/main.rs (2 hunks)

🔇 Additional comments (1)

src/main.rs (1)

264-276: Confirm: Hash implementation is correct, but breaking change requires attention.

The multiply-31 combining with wrapping arithmetic is sound. However, this definitively changes all hash values—users with existing ignore files must regenerate them.

Verified findings:

Collision.key stores the multiply-31 combined hash (line 291)

Ignore files are matched via exact hash comparison: ignore_hashes.contains_key(&p.key) (line 337)

No version markers, migration logic, or format compatibility checks exist in the codebase

"Identical results" in the PR description means the same duplicates are detected, not bit-identical hashes

Required before merge:

Confirm regression tests pass with new hash computation

Add a note in release documentation about the ignore file format change

Consider adding a clear error message if users attempt to load old ignore files (optional but recommended)

Sometimes it's useful to exclude directories when working through a new code base, so that you can focus initial efforts. Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

This is useful while I'm developing the tool and using custom local builds for testing. To remember what the binary was built from. Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/main.rs (1)

265-275: Harden collision grouping key to reduce accidental merges.

hash * 31 + value is fast but weaker; results_hash grouping by key (see Line 404) can merge unrelated collisions on rare key clashes. Consider adding a secondary discriminator (e.g., include num_lines and boundary hashes) in the grouping key.

Also applies to: 291-293

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 265 - 275, The current collision grouping key built
via hash_key (in the loop computing hash_key using
wrapping_mul(31).wrapping_add(...)) is too weak; modify the key construction
(and any place that uses results_hash grouping by key) to include a secondary
discriminator: incorporate the total number of lines (e.g., l_info.num_lines or
a shared num_lines value) and boundary hashes (e.g., first/last element hashes
from l_h and r_h or hashes of the first/last lines) into the final key before
inserting into results_hash to reduce accidental merges; update the code paths
that compute hash_key (the loop using l_h[r_index]/r_h[l_index] and the similar
block around lines 291-293) to combine these additional fields (e.g., fold them
into hash_key with wrapping operations or a small keyed mix) so results_hash
grouping by key uniquely separates different-length or boundary-different
collisions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Line 11: Remove the ".github" entry from the repository-level .gitignore so CI
workflows and repository templates are tracked; locate the ".gitignore" file
that contains the line ".github" and delete that line (if you need to ignore
files only inside specific subfolders, add localized .gitignore files inside
those subfolders instead) to ensure workflow templates and actions in the
.github directory are committed and versioned.

In `@src/main.rs`:
- Around line 589-597: The code currently swallows canonicalize errors in the
filter_map closure (canonicalize(dir)) and prints a warning; instead treat
invalid --exclude-dir paths as input errors by failing fast: replace the
filter_map/canonicalize pattern with a mapping that returns Result<PathBuf,
io::Error> (e.g., map(|dir| canonicalize(dir).map_err(|e| format!("Invalid
--exclude-dir '{}': {}", dir, e)))) and then collect into a Result<Vec<PathBuf>,
_> (or propagate the first error) so the error is returned from main (or causes
a non-zero exit) rather than being converted to None; update the code paths that
consume the excluded dirs to handle the Result accordingly and ensure the error
message references the offending dir and canonicalize error.

---

Nitpick comments:
In `@src/main.rs`:
- Around line 265-275: The current collision grouping key built via hash_key (in
the loop computing hash_key using wrapping_mul(31).wrapping_add(...)) is too
weak; modify the key construction (and any place that uses results_hash grouping
by key) to include a secondary discriminator: incorporate the total number of
lines (e.g., l_info.num_lines or a shared num_lines value) and boundary hashes
(e.g., first/last element hashes from l_h and r_h or hashes of the first/last
lines) into the final key before inserting into results_hash to reduce
accidental merges; update the code paths that compute hash_key (the loop using
l_h[r_index]/r_h[l_index] and the similar block around lines 291-293) to combine
these additional fields (e.g., fold them into hash_key with wrapping operations
or a small keyed mix) so results_hash grouping by key uniquely separates
different-length or boundary-different collisions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8e7b55ab-6a1b-4bfd-a97a-0e38f0fdc0dd

📥 Commits

Reviewing files that changed from the base of the PR and between f986be4 and dd0dd98.

📒 Files selected for processing (4)

.gitignore
Cargo.toml
build.rs
src/main.rs

coderabbitai · 2026-03-04T16:17:31Z

.gitignore


 # These are backup files generated by rustfmt
 **/*.rs.bk
+.github


⚠️ Potential issue | 🟠 Major

Do not ignore .github at the repository level.

This prevents tracking workflows/templates in version control and can silently break team CI/release updates.

Suggested fix

-.github

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.github

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.gitignore at line 11, Remove the ".github" entry from the repository-level .gitignore so CI workflows and repository templates are tracked; locate the ".gitignore" file that contains the line ".github" and delete that line (if you need to ignore files only inside specific subfolders, add localized .gitignore files inside those subfolders instead) to ensure workflow templates and actions in the .github directory are committed and versioned.

coderabbitai · 2026-03-04T16:17:31Z

src/main.rs

+        .filter_map(|dir| match canonicalize(dir) {
+            Ok(path) => Some(path),
+            Err(e) => {
+                eprintln!(
+                    "WARNING: Unable to canonicalize excluded directory {}, reason: {}",
+                    dir, e
+                );
+                None
+            }


⚠️ Potential issue | 🟡 Minor

Treat invalid --exclude-dir paths as input errors, not warnings.

Current behavior warns and proceeds, which can silently include directories the user explicitly intended to exclude.

Suggested fix

.filter_map(|dir| match canonicalize(dir) { Ok(path) => Some(path), Err(e) => { eprintln!( "WARNING: Unable to canonicalize excluded directory {}, reason: {}", dir, e ); - None + process::exit(1); } })

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.filter_map(|dir| match canonicalize(dir) {

Ok(path) => Some(path),

Err(e) => {

eprintln!(

"WARNING: Unable to canonicalize excluded directory {}, reason: {}",

dir, e

);

None

}

.filter_map(|dir| match canonicalize(dir) {

Ok(path) => Some(path),

Err(e) => {

eprintln!(

"WARNING: Unable to canonicalize excluded directory {}, reason: {}",

dir, e

);

process::exit(1);

}

})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/main.rs` around lines 589 - 597, The code currently swallows canonicalize errors in the filter_map closure (canonicalize(dir)) and prints a warning; instead treat invalid --exclude-dir paths as input errors by failing fast: replace the filter_map/canonicalize pattern with a mapping that returns Result<PathBuf, io::Error> (e.g., map(|dir| canonicalize(dir).map_err(|e| format!("Invalid --exclude-dir '{}': {}", dir, e)))) and then collect into a Result<Vec<PathBuf>, _> (or propagate the first error) so the error is returned from main (or causes a non-zero exit) rather than being converted to None; update the code paths that consume the excluded dirs to handle the Result accordingly and ensure the error message references the offending dir and canonicalize error.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 43: The README claims excluded directories must be relative but the code
in main.rs canonicalizes any path; update files_to_process to enforce the
relative-path rule: validate each entry of exclude_dirs (the input vector used
to build excluded_paths) and return an error (or skip with a clear message) if a
provided path is absolute or resolves outside the start directory; use
Path::is_absolute() to reject absolute inputs and after canonicalize ensure the
canonicalized path starts_with the start directory canonical path before adding
to excluded_paths (references: files_to_process, exclude_dirs, excluded_paths,
canonicalize).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6939d3ff-bc4d-40db-9d50-536d8672b1c3

📥 Commits

Reviewing files that changed from the base of the PR and between dd0dd98 and 6b9d372.

📒 Files selected for processing (2)

Cargo.toml
README.md

🚧 Files skipped from review as they are similar to previous changes (1)

Cargo.toml

coderabbitai · 2026-03-04T16:20:38Z

README.md

    -f, --file <pattern or specific file>        pattern or file eg. "**/*.[h|c]" recursive, "*.py", "file.ext", can repeat [required]
    -i, --ignore <file name>                     file containing hash values to ignore, one per line
    -t, --threads <thread number>                number of threads to utilize. Set to 0 to match #cpu cores [default: 4]
+    -x, --exclude-dir <directory path>           directory to exclude (repeatable). Path must be relative to the start directory


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check the README.md documentation head -50 README.md | tail -20

Repository: tasleson/duplihere

Length of output: 1250

🏁 Script executed:

# Search for the canonicalize logic in src/main.rs and surrounding context rg -n -A10 -B5 'exclude_dirs' src/main.rs | head -60

Repository: tasleson/duplihere

Length of output: 2036

🏁 Script executed:

# Check for any path validation logic for relative paths rg -n 'is_relative|is_absolute|relative|canonicalize' src/main.rs

Repository: tasleson/duplihere

Length of output: 399

🏁 Script executed:

# Get the specific implementation around lines 711-740 sed -n '700,750p' src/main.rs

Repository: tasleson/duplihere

Length of output: 1509

Documentation claims relative paths are required, but implementation accepts any path.

The documentation at line 755 states "Path must be relative to the start directory," but the implementation in src/main.rs:587-590 canonicalizes all excluded directories to absolute paths without enforcing or validating the relative path requirement:

let excluded_paths: Vec<std::path::PathBuf> = exclude_dirs .iter() .filter_map(|dir| match canonicalize(dir) { Ok(path) => Some(path), // ... }) .collect();

The code accepts both relative and absolute paths and converts them all to canonical absolute paths. No validation exists to enforce the relative path constraint.

Either:

Remove the relative path requirement from the documentation, or

Add validation in files_to_process to enforce that input paths are relative

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` at line 43, The README claims excluded directories must be relative but the code in main.rs canonicalizes any path; update files_to_process to enforce the relative-path rule: validate each entry of exclude_dirs (the input vector used to build excluded_paths) and return an error (or skip with a clear message) if a provided path is absolute or resolves outside the start directory; use Path::is_absolute() to reject absolute inputs and after canonicalize ensure the canonicalized path starts_with the start directory canonical path before adding to excluded_paths (references: files_to_process, exclude_dirs, excluded_paths, canonicalize).

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

tasleson added 5 commits March 4, 2026 09:41

Add ability to exclude directories

50f091a

Sometimes it's useful to exclude directories when working through a new code base, so that you can focus initial efforts. Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

Add build metadata to binary

033fc0b

This is useful while I'm developing the tool and using custom local builds for testing. To remember what the binary was built from. Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

gitignore: add .github

dd0dd98

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

Bump package version

895192f

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

README.md: Update

6b9d372

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use multiply-31 hash comb. for 11% speed up#16

perf: use multiply-31 hash comb. for 11% speed up#16
tasleson wants to merge 6 commits intomainfrom
perf_improvement

tasleson commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 5, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tasleson commented Nov 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tasleson commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading