Skip to content

⚡ Bolt: Hoist regex patterns and avoid repeated string allocations in ATS Generator#317

Open
anchapin wants to merge 1 commit into
mainfrom
bolt/ats-generator-perf-1347257136309222624
Open

⚡ Bolt: Hoist regex patterns and avoid repeated string allocations in ATS Generator#317
anchapin wants to merge 1 commit into
mainfrom
bolt/ats-generator-perf-1347257136309222624

Conversation

@anchapin
Copy link
Copy Markdown
Owner

@anchapin anchapin commented May 23, 2026

💡 What:

  • Pre-compiled 7 regular expressions (_TABLE_PATTERN, _SPECIAL_CHARS_PATTERN, _EMAIL_PATTERN, _PHONE_PATTERN, _QUANTIFIABLE_PATTERN, _ACRONYM_PATTERN, _JSON_PATTERN, _RESUME_KEYWORD_PATTERN, _SUMMARY_KEYWORD_PATTERN) to module-level constants in cli/generators/ats_generator.py.
  • Hoisted the static list _ACTION_VERBS to a module-level constant.
  • Updated _get_all_text to preserve string case, resolving previous regressions around acronym searches which require uppercase characters.
  • Modified _check_readability to cache .lower() text outside loop comprehensions.

🎯 Why:
The ATS Generator frequently recalculates complex logic and dynamically rebuilds regular expressions during execution. Specifically, _check_readability was doing sum(1 for verb in action_verbs if verb in all_text.lower()), recalculating .lower() inside a loop. The regex patterns were being continually constructed via string inputs on every parse cycle.

📊 Impact:

  • Prevents redundant regex compilations at runtime.
  • Resolves redundant iterations and garbage collections due to dynamically initialized lists and looped allocations.
  • Lowercasing operation is explicitly applied only when needed.

🔬 Measurement:

  • pytest tests/test_ats_generator.py (passes)
  • make format and make lint confirm no breakages.

PR created automatically by Jules for task 1347257136309222624 started by @anchapin

Summary by Sourcery

Hoist frequently used regex patterns and static data in the ATS generator to module scope and adjust text handling to improve performance and correctness.

Enhancements:

  • Pre-compile ATS generator regex patterns and reuse them across formatting, contact validation, readability, and keyword extraction checks.
  • Promote the action verb list to a module-level constant and cache lowercased text when scanning for verbs to avoid repeated string allocations.
  • Preserve original casing in aggregated resume text while limiting lowercasing to call sites that require it for matching.

Documentation:

  • Extend the Bolt engineering notes with guidance on hoisting regex patterns and static arrays in the ATS generator to reduce redundant work.

Tests:

  • Update ATS generator tests to account for case-preserving behavior in _get_all_text.

By hoisting regular expressions out of repeatedly called methods, we avoid the overhead of dynamically compiling and re-allocating them during execution. Also hoisted the static action verbs list and refactored the lowercase text caching to only do what's required without mutating cases too early, preventing logical bugs like missing acronyms.

Co-authored-by: anchapin <6326294+anchapin@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 23, 2026

Reviewer's Guide

Hoists frequently used regex patterns and static lists in the ATS generator to module-level constants, adjusts string handling to preserve case where needed, and reduces repeated lowercasing and regex compilation in hot paths, with associated test updates and documentation of the performance learnings.

Flow diagram for updated text processing and regex usage in ATS generator

flowchart TD
    A["resume_data"] --> B["_get_all_text(resume_data)"]
    B --> C["all_text (case preserved)"]

    %% Acronym path uses original casing
    C --> D["_ACRONYM_PATTERN.findall(all_text)"]
    D --> E["Acronym heuristics in _check_readability"]

    %% Readability checks using lowercased text
    C --> F["all_text_lower = all_text.lower()"]
    F --> G["sum(1 for verb in _ACTION_VERBS if verb in all_text_lower)"]
    F --> H["_QUANTIFIABLE_PATTERN.search(all_text)"]

    %% Format parsing using precompiled patterns
    C --> I["_TABLE_PATTERN.search(all_text)"]
    C --> J["_SPECIAL_CHARS_PATTERN.findall(all_text)"]

    %% Contact info using precompiled patterns
    K["contact.get('email')"] --> L["_EMAIL_PATTERN.search(email)"]
    M["contact.get('phone')"] --> N["_PHONE_PATTERN.search(phone)"]

    %% Keyword extraction using precompiled patterns
    O["response from _call_openai"] --> P["_JSON_PATTERN.search(response)"]
    Q["resume bullets text.lower()"] --> R["_RESUME_KEYWORD_PATTERN.findall(text)"]
    S["summary.lower()"] --> T["_SUMMARY_KEYWORD_PATTERN.findall(summary)"]
Loading

File-Level Changes

Change Details Files
Pre-compile ATS generator regexes at module scope and replace dynamic regex construction with shared pattern objects.
  • Introduce module-level compiled regex constants for table detection, special character detection, email and phone validation, quantifiable achievements, acronym detection, JSON extraction, and keyword extraction.
  • Replace inline re.search/re.findall calls with calls to the corresponding pre-compiled pattern objects throughout _check_format_parsing, _check_contact_info, _check_readability, _extract_job_keywords, and _extract_resume_keywords.
  • Use regex flags such as re.IGNORECASE and re.DOTALL in the compiled patterns instead of repeating equivalent behavior at call sites.
cli/generators/ats_generator.py
Reduce repeated string allocation and lowercasing work in readability checks while preserving case where needed for acronym logic.
  • Change _get_all_text to return case-preserved concatenated text instead of lowercasing eagerly so that acronym regexes can operate on the original casing.
  • Cache a lowercased version of the full text once in _check_readability and reuse it when counting action verbs instead of calling .lower() inside a comprehension.
  • Hoist the static list of action verbs to a module-level _ACTION_VERBS constant and reference it from _check_readability.
cli/generators/ats_generator.py
Align tests and internal documentation with the new text casing behavior and pre-compilation pattern guidance.
  • Update test_get_all_text_from_nested_dict to explicitly lowercase the returned text before assertions now that _get_all_text preserves case.
  • Extend .jules/bolt.md with a new learning entry explaining the discovery about ATS generator regex and string handling and documenting the hoisting and pre-compilation pattern as best practice.
tests/test_ats_generator.py
.jules/bolt.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The _JSON_PATTERN = re.compile(r"\[.*\]", flags=re.DOTALL) is a greedy match and can overrun or be unnecessarily expensive on long responses; consider making it non-greedy (e.g. r"\[.*?\]") or anchoring it more tightly around the expected JSON segment.
  • Now that _get_all_text returns case-preserved text, it might be worth quickly scanning other call sites in this module to see if any still implicitly rely on the old lowercased behavior and should explicitly call .lower() like _check_readability does.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `_JSON_PATTERN = re.compile(r"\[.*\]", flags=re.DOTALL)` is a greedy match and can overrun or be unnecessarily expensive on long responses; consider making it non-greedy (e.g. `r"\[.*?\]"`) or anchoring it more tightly around the expected JSON segment.
- Now that `_get_all_text` returns case-preserved text, it might be worth quickly scanning other call sites in this module to see if any still implicitly rely on the old lowercased behavior and should explicitly call `.lower()` like `_check_readability` does.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant