Skip to content

perf: use multiply-31 hash comb. for 11% speed up#16

Open
tasleson wants to merge 6 commits intomainfrom
perf_improvement
Open

perf: use multiply-31 hash comb. for 11% speed up#16
tasleson wants to merge 6 commits intomainfrom
perf_improvement

Conversation

@tasleson
Copy link
Copy Markdown
Owner

@tasleson tasleson commented Nov 5, 2025

Replace DefaultHasher with simple hash * 31 + value combining in maximize_collision. Reduces runtime from 22.0s to 19.5s on large workloads while maintaining identical results.

reference Knuth TAOCP Vol.3 §6.4 on hashing

Refer to "The Art of Computer Programming, Volume 3 (2nd ed.)", Section 6.4 — Hashing (pp. 506–549).

Relevant subsections:

  • 6.4 "Hashing by Multiplication and Division" (pp. 513–519)
  • 6.4 "Hashing by Polynomial Accumulation" (pp. 521–523)

Summary by CodeRabbit

  • New Features

    • Add directory exclusion option to skip specified folders when processing files.
    • Add a version flag to print build/version information.
  • Chores

    • Add build-time metadata generation.
    • Update project ignore rules and bump package version.
  • Documentation

    • Update usage/help text to document new flags.
  • Refactor

    • Adjusted internal collision/key handling for improved behavior.

Replace DefaultHasher with simple `hash * 31 + value` combining in
maximize_collision. Reduces runtime from 22.0s to 19.5s on large
workloads while maintaining identical results.

reference Knuth TAOCP Vol.3 §6.4 on hashing

Refer to "The Art of Computer Programming, Volume 3 (2nd ed.)",
Section 6.4 — Hashing (pp. 506–549).

Relevant subsections:
- 6.4 "Hashing by Multiplication and Division" (pp. 513–519)
- 6.4 "Hashing by Polynomial Accumulation" (pp. 521–523)

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 5, 2025

📝 Walkthrough

Walkthrough

Adds CLI flags to exclude directories and print version, updates file discovery to skip canonicalized excluded directories, introduces a build script (vergen-gix) to embed Git metadata, bumps package version, and adds .github to .gitignore. (49 words)

Changes

Cohort / File(s) Summary
CLI, options & discovery
src/main.rs
Adds exclude_dirs: Vec<String> and version: bool to Options; new print_version(); CLI flags -x/--exclude-dir and -v/--version; files_to_process signature now accepts exclude_dirs and skips files whose canonical paths start with any excluded dir; propagates exclusions through main.
Build script & manifest
build.rs, Cargo.toml
Adds build.rs using vergen_gix to emit Git metadata at build time; bumps package version to 0.10.0 and adds vergen-gix = "9" under [build-dependencies].
Docs & repo config
README.md, .gitignore
Updates README usage/help to document -x/--exclude-dir and -v/--version; adds .github to .gitignore.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (user)
    participant Main as main.rs
    participant Discover as files_to_process
    participant FS as Filesystem

    CLI->>Main: parse args (globs, exclude_dirs, version)
    alt version flag set
        Main->>Main: print_version()
        Main-->>CLI: exit
    else
        Main->>Discover: files_to_process(globs, exclude_dirs)
        Discover->>Discover: canonicalize exclude_dirs
        Discover->>FS: glob & iterate files
        loop for each file
            Discover->>FS: canonicalize file path
            alt file path starts with excluded dir
                Discover-->>FS: skip file
            else
                Discover-->>Main: register file
            end
        end
        Discover-->>Main: return files
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through flags and paths today,
Excluding burrows along the way.
Built with git-marks, version in paw,
Skipping folders I simply saw.
A nimble rabbit, CLI hooray! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title focuses on a performance optimization using multiply-31 hash combining, but the changeset includes significant feature additions (exclude_dirs, version flag) and infrastructure changes (build script, version bump) unrelated to hashing. Update the title to reflect the primary changes: consider 'Add exclude_dirs support and version flag' or similar, as the hashing optimization appears secondary to the main feature additions in this changeset.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch perf_improvement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/main.rs (2)

291-291: Note: Changed hash values will be visible to users.

The key field is displayed to users (line 346) and used for ignore file matching (line 337). Users will see different hash signatures for the same duplicates after this change.

Consider adding a note in the help text or documentation about hash value changes between versions if users might have saved or shared hash signatures.


293-293: Nice simplification!

Direct vector initialization is cleaner and more efficient than creating an intermediate vector. The code is more readable and avoids an unnecessary allocation.

The comment on line 265 could be more precise—compiler optimization of x * 31 to (x << 5) - x is not guaranteed and may not provide benefits on modern CPUs where integer multiplication is highly optimized. Consider simplifying to just explain the algorithm choice:

-    // Multiply-add combining - compiler optimizes 31 to (hash << 5) - hash
+    // Multiply-add combining using constant 31 (from Knuth)
     let mut hash_key: u64 = 0;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb16065 and f986be4.

📒 Files selected for processing (1)
  • src/main.rs (2 hunks)
🔇 Additional comments (1)
src/main.rs (1)

264-276: Confirm: Hash implementation is correct, but breaking change requires attention.

The multiply-31 combining with wrapping arithmetic is sound. However, this definitively changes all hash values—users with existing ignore files must regenerate them.

Verified findings:

  • Collision.key stores the multiply-31 combined hash (line 291)
  • Ignore files are matched via exact hash comparison: ignore_hashes.contains_key(&p.key) (line 337)
  • No version markers, migration logic, or format compatibility checks exist in the codebase
  • "Identical results" in the PR description means the same duplicates are detected, not bit-identical hashes

Required before merge:

  1. Confirm regression tests pass with new hash computation
  2. Add a note in release documentation about the ignore file format change
  3. Consider adding a clear error message if users attempt to load old ignore files (optional but recommended)

tasleson added 5 commits March 4, 2026 09:41
Sometimes it's useful to exclude directories when working through
a new code base, so that you can focus initial efforts.

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
This is useful while I'm developing the tool and using custom
local builds for testing.  To remember what the binary was
built from.

Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Signed-off-by: Tony Asleson <tony.asleson@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/main.rs (1)

265-275: Harden collision grouping key to reduce accidental merges.

hash * 31 + value is fast but weaker; results_hash grouping by key (see Line 404) can merge unrelated collisions on rare key clashes. Consider adding a secondary discriminator (e.g., include num_lines and boundary hashes) in the grouping key.

Also applies to: 291-293

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 265 - 275, The current collision grouping key built
via hash_key (in the loop computing hash_key using
wrapping_mul(31).wrapping_add(...)) is too weak; modify the key construction
(and any place that uses results_hash grouping by key) to include a secondary
discriminator: incorporate the total number of lines (e.g., l_info.num_lines or
a shared num_lines value) and boundary hashes (e.g., first/last element hashes
from l_h and r_h or hashes of the first/last lines) into the final key before
inserting into results_hash to reduce accidental merges; update the code paths
that compute hash_key (the loop using l_h[r_index]/r_h[l_index] and the similar
block around lines 291-293) to combine these additional fields (e.g., fold them
into hash_key with wrapping operations or a small keyed mix) so results_hash
grouping by key uniquely separates different-length or boundary-different
collisions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Line 11: Remove the ".github" entry from the repository-level .gitignore so CI
workflows and repository templates are tracked; locate the ".gitignore" file
that contains the line ".github" and delete that line (if you need to ignore
files only inside specific subfolders, add localized .gitignore files inside
those subfolders instead) to ensure workflow templates and actions in the
.github directory are committed and versioned.

In `@src/main.rs`:
- Around line 589-597: The code currently swallows canonicalize errors in the
filter_map closure (canonicalize(dir)) and prints a warning; instead treat
invalid --exclude-dir paths as input errors by failing fast: replace the
filter_map/canonicalize pattern with a mapping that returns Result<PathBuf,
io::Error> (e.g., map(|dir| canonicalize(dir).map_err(|e| format!("Invalid
--exclude-dir '{}': {}", dir, e)))) and then collect into a Result<Vec<PathBuf>,
_> (or propagate the first error) so the error is returned from main (or causes
a non-zero exit) rather than being converted to None; update the code paths that
consume the excluded dirs to handle the Result accordingly and ensure the error
message references the offending dir and canonicalize error.

---

Nitpick comments:
In `@src/main.rs`:
- Around line 265-275: The current collision grouping key built via hash_key (in
the loop computing hash_key using wrapping_mul(31).wrapping_add(...)) is too
weak; modify the key construction (and any place that uses results_hash grouping
by key) to include a secondary discriminator: incorporate the total number of
lines (e.g., l_info.num_lines or a shared num_lines value) and boundary hashes
(e.g., first/last element hashes from l_h and r_h or hashes of the first/last
lines) into the final key before inserting into results_hash to reduce
accidental merges; update the code paths that compute hash_key (the loop using
l_h[r_index]/r_h[l_index] and the similar block around lines 291-293) to combine
these additional fields (e.g., fold them into hash_key with wrapping operations
or a small keyed mix) so results_hash grouping by key uniquely separates
different-length or boundary-different collisions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8e7b55ab-6a1b-4bfd-a97a-0e38f0fdc0dd

📥 Commits

Reviewing files that changed from the base of the PR and between f986be4 and dd0dd98.

📒 Files selected for processing (4)
  • .gitignore
  • Cargo.toml
  • build.rs
  • src/main.rs


# These are backup files generated by rustfmt
**/*.rs.bk
.github
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not ignore .github at the repository level.

This prevents tracking workflows/templates in version control and can silently break team CI/release updates.

Suggested fix
-.github
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.github
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 11, Remove the ".github" entry from the repository-level
.gitignore so CI workflows and repository templates are tracked; locate the
".gitignore" file that contains the line ".github" and delete that line (if you
need to ignore files only inside specific subfolders, add localized .gitignore
files inside those subfolders instead) to ensure workflow templates and actions
in the .github directory are committed and versioned.

Comment on lines +589 to +597
.filter_map(|dir| match canonicalize(dir) {
Ok(path) => Some(path),
Err(e) => {
eprintln!(
"WARNING: Unable to canonicalize excluded directory {}, reason: {}",
dir, e
);
None
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Treat invalid --exclude-dir paths as input errors, not warnings.

Current behavior warns and proceeds, which can silently include directories the user explicitly intended to exclude.

Suggested fix
         .filter_map(|dir| match canonicalize(dir) {
             Ok(path) => Some(path),
             Err(e) => {
                 eprintln!(
                     "WARNING: Unable to canonicalize excluded directory {}, reason: {}",
                     dir, e
                 );
-                None
+                process::exit(1);
             }
         })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.filter_map(|dir| match canonicalize(dir) {
Ok(path) => Some(path),
Err(e) => {
eprintln!(
"WARNING: Unable to canonicalize excluded directory {}, reason: {}",
dir, e
);
None
}
.filter_map(|dir| match canonicalize(dir) {
Ok(path) => Some(path),
Err(e) => {
eprintln!(
"WARNING: Unable to canonicalize excluded directory {}, reason: {}",
dir, e
);
process::exit(1);
}
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` around lines 589 - 597, The code currently swallows canonicalize
errors in the filter_map closure (canonicalize(dir)) and prints a warning;
instead treat invalid --exclude-dir paths as input errors by failing fast:
replace the filter_map/canonicalize pattern with a mapping that returns
Result<PathBuf, io::Error> (e.g., map(|dir| canonicalize(dir).map_err(|e|
format!("Invalid --exclude-dir '{}': {}", dir, e)))) and then collect into a
Result<Vec<PathBuf>, _> (or propagate the first error) so the error is returned
from main (or causes a non-zero exit) rather than being converted to None;
update the code paths that consume the excluded dirs to handle the Result
accordingly and ensure the error message references the offending dir and
canonicalize error.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Line 43: The README claims excluded directories must be relative but the code
in main.rs canonicalizes any path; update files_to_process to enforce the
relative-path rule: validate each entry of exclude_dirs (the input vector used
to build excluded_paths) and return an error (or skip with a clear message) if a
provided path is absolute or resolves outside the start directory; use
Path::is_absolute() to reject absolute inputs and after canonicalize ensure the
canonicalized path starts_with the start directory canonical path before adding
to excluded_paths (references: files_to_process, exclude_dirs, excluded_paths,
canonicalize).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6939d3ff-bc4d-40db-9d50-536d8672b1c3

📥 Commits

Reviewing files that changed from the base of the PR and between dd0dd98 and 6b9d372.

📒 Files selected for processing (2)
  • Cargo.toml
  • README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • Cargo.toml

-f, --file <pattern or specific file> pattern or file eg. "**/*.[h|c]" recursive, "*.py", "file.ext", can repeat [required]
-i, --ignore <file name> file containing hash values to ignore, one per line
-t, --threads <thread number> number of threads to utilize. Set to 0 to match #cpu cores [default: 4]
-x, --exclude-dir <directory path> directory to exclude (repeatable). Path must be relative to the start directory
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check the README.md documentation
head -50 README.md | tail -20

Repository: tasleson/duplihere

Length of output: 1250


🏁 Script executed:

# Search for the canonicalize logic in src/main.rs and surrounding context
rg -n -A10 -B5 'exclude_dirs' src/main.rs | head -60

Repository: tasleson/duplihere

Length of output: 2036


🏁 Script executed:

# Check for any path validation logic for relative paths
rg -n 'is_relative|is_absolute|relative|canonicalize' src/main.rs

Repository: tasleson/duplihere

Length of output: 399


🏁 Script executed:

# Get the specific implementation around lines 711-740
sed -n '700,750p' src/main.rs

Repository: tasleson/duplihere

Length of output: 1509


Documentation claims relative paths are required, but implementation accepts any path.

The documentation at line 755 states "Path must be relative to the start directory," but the implementation in src/main.rs:587-590 canonicalizes all excluded directories to absolute paths without enforcing or validating the relative path requirement:

let excluded_paths: Vec<std::path::PathBuf> = exclude_dirs
    .iter()
    .filter_map(|dir| match canonicalize(dir) {
        Ok(path) => Some(path),
        // ...
    })
    .collect();

The code accepts both relative and absolute paths and converts them all to canonical absolute paths. No validation exists to enforce the relative path constraint.

Either:

  1. Remove the relative path requirement from the documentation, or
  2. Add validation in files_to_process to enforce that input paths are relative
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 43, The README claims excluded directories must be
relative but the code in main.rs canonicalizes any path; update files_to_process
to enforce the relative-path rule: validate each entry of exclude_dirs (the
input vector used to build excluded_paths) and return an error (or skip with a
clear message) if a provided path is absolute or resolves outside the start
directory; use Path::is_absolute() to reject absolute inputs and after
canonicalize ensure the canonicalized path starts_with the start directory
canonical path before adding to excluded_paths (references: files_to_process,
exclude_dirs, excluded_paths, canonicalize).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant