Skip to content

feat: add Indeed Ireland scanner (scan-indeed.py)#719

Open
kejiali wants to merge 1 commit into
santifer:mainfrom
kejiali:feat/scan-indeed-ireland
Open

feat: add Indeed Ireland scanner (scan-indeed.py)#719
kejiali wants to merge 1 commit into
santifer:mainfrom
kejiali:feat/scan-indeed-ireland

Conversation

@kejiali
Copy link
Copy Markdown

@kejiali kejiali commented May 21, 2026

Closes #720

What

Adds a Python-based scanner for ie.indeed.com (Ireland) using Scrapling to bypass Cloudflare bot protection. Searches Indeed Ireland for configured queries, applies title_filter from portals.yml, deduplicates against existing history, and appends new offers to pipeline.md + scan-history.tsv.

Why

The existing scan.mjs covers Greenhouse/Ashby/Lever APIs but misses Indeed Ireland — Ireland's largest job search website and a primary source for local roles. Indeed uses Cloudflare bot protection that blocks standard HTTP clients, so Scrapling is needed.

Changes

  • New: scan-indeed.py — the scanner script (Python)
  • New: npm run scan:indeed script in package.json
  • New: indeed_queries section in templates/portals.example.yml
  • Updated: AGENTS.md — main files table
  • Updated: modes/_shared.md — tools + web scraping escalation ladder
  • Updated: docs/SCRIPTS.md — script reference
  • Updated: docs/SETUP.md — optional Python dependency

Indeed Bot Protection — Tested Approaches

Tool Indeed Result Notes
web_fetch ✅ Works Surprisingly passes for search results
Jina Reader ❌ Blocked Cloudflare 403
Scrapling Fetcher ✅ Works Used in this PR — fast, lightweight, bypasses Cloudflare
Scrapling StealthyFetcher ✅ Works Fallback if Fetcher stops working (real browser)
CloakBrowser ✅ Works Nuclear fallback — 58 C++ source-level Chromium patches, humanize=True

Dependencies

  • scrapling (Python, pip)
  • pyyaml (Python, pip)

Usage

python3 scan-indeed.py              # scan all queries, 2 pages each
python3 scan-indeed.py --dry-run    # preview without writing files
python3 scan-indeed.py --pages 3    # scan up to 3 pages per query
npm run scan:indeed                 # same via npm

Tested

  • Verified against live ie.indeed.com (38 results, 18 passed filters)
  • Dedup against existing scan-history/pipeline/applications works
  • --dry-run mode works without writing files

Security

Scanned with ASH v3.2.5 (Bandit + Semgrep + detect-secrets): 0 findings.

Summary by CodeRabbit

  • New Features

    • Added an Indeed Ireland job scanner that fetches and filters job listings, dedupes against scan history, and appends new matches to the pipeline.
    • CLI support for dry-run and pagination; script exposed via a package script.
  • Documentation

    • Updated setup prerequisites for optional Python tooling.
    • Added scanner usage docs, query configuration guidance, and a web-scraping escalation guide.

Review Change Stack

@github-actions
Copy link
Copy Markdown
Contributor

Welcome to career-ops, @kejiali! Thanks for your first PR.

A few things to know:

  • Tests will run automatically — check the status below
  • Make sure you've linked a related issue (required for features)
  • Read CONTRIBUTING.md if you haven't

We'll review your PR soon. Join our Discord if you have questions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

Added scan-indeed.py, a Python scanner for Indeed Ireland job listings using Scrapling to bypass Cloudflare. The script loads query and title-filter configuration from portals.yml, scrapes search results, deduplicates against existing pipeline/history, and appends new jobs to pipeline.md and scan-history.tsv. Integrated via npm script, documented in setup and mode guides.

Changes

Indeed Ireland Scanner Implementation

Layer / File(s) Summary
Portal configuration and prerequisites
templates/portals.example.yml, docs/SETUP.md
Added indeed_queries section with four example Dublin/Ireland job search configurations and updated setup docs to mention Python 3.10+ with scrapling and pyyaml.
Core scanner implementation
scan-indeed.py
Implemented complete Indeed Ireland scraper: module entrypoint and paths, load_portals()/load_seen_urls(), build_title_filter(), get_indeed_queries(), _first() helper, scrape_indeed_page() using Scrapling Fetcher, CLI main() with --dry-run/--pages, in-run dedupe and title filtering, and file writes to data/pipeline.md and data/scan-history.tsv.
Project integration and documentation
package.json, AGENTS.md, docs/SCRIPTS.md, modes/_shared.md
Added scan:indeed npm script, updated AGENTS.md Main Files table, added SCRIPTS.md quick-reference row and full scan-indeed section (usage, deps, exit codes), and extended modes/_shared.md Tools list and Web Scraping Escalation Ladder.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: adding a new Indeed Ireland scanner script.
Linked Issues check ✅ Passed All coding requirements from issue #720 are met: scan-indeed.py implements Scrapling-based scraping of Indeed Ireland, supports configurable queries with title_filter fallback, applies deduplication, provides --dry-run mode, writes to pipeline.md and scan-history.tsv, and documents escalation paths.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the Indeed Ireland scanner: the main script, npm script entry, configuration template, and documentation updates for tools and setup.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kejiali kejiali force-pushed the feat/scan-indeed-ireland branch 2 times, most recently from cdb7f4d to f87f6f7 Compare May 21, 2026 21:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@AGENTS.md`:
- Line 68: Update AGENTS.md to include a Gemini-specific subsection alongside
the existing agent entries (e.g., near the `scan-indeed.py` line) that documents
any Gemini-only requirements: specify protocol differences (URL scheme and
recommended ports), content formatting expectations (text/gemini MIME or plain
text rules), recommended user-agent and rate-limiting behavior for Gemini
crawlers, and any portal-specific title_filter handling differences for Gemini
resources; reference “Gemini” explicitly so reviewers can find the guidance
easily and ensure it only contains items that apply exclusively to Gemini
agents.

In `@docs/SCRIPTS.md`:
- Line 21: Update the command examples in SCRIPTS.md to use the npm script alias
instead of the raw Python invocation: replace occurrences of "python3
scan-indeed.py" with "npm run scan:indeed" (and similarly update the other
instance noted around lines 194-202) so the docs match the project's
package.json script names; search for any remaining "scan-indeed.py" examples
and standardize them to the npm form used by the project.

In `@scan-indeed.py`:
- Line 237: The prints use unnecessary f-strings (e.g., print(f"📊 Results:"),
print(f"…"), print(f"…")) which triggers Ruff F541; remove the leading 'f' on
those string literals so they become normal strings (e.g., print("📊 Results:"))
for all three occurrences (the three print(...) calls shown in the diff) to
resolve the lint errors.
- Around line 229-231: The append to SCAN_HISTORY_PATH (inside the if not
args.dry_run block that writes job['url'] ... skipped_title) can create the file
without a header when it doesn't exist; before any writes to SCAN_HISTORY_PATH
ensure the file exists and has the TSV header (e.g., check
os.path.exists(SCAN_HISTORY_PATH) or open with 'x' to create) and write the
header row once (use the same header used later around the code that currently
initializes the header), then proceed to append; apply the same pre-check/header
initialization for the other append site referenced around the later lines (the
second write at lines 264-265) so both write sites rely on the same
header-initialization logic.
- Around line 103-105: The fallback query creation uses keywords[0] and will
IndexError if title_filter.positive (the source of keywords) is empty; update
the logic around queries/keywords (the block that sets queries when not queries)
to first check whether keywords (or title_filter.positive) is non-empty and only
create a fallback query when a keyword exists, otherwise leave queries empty so
the subsequent "no queries configured" handling runs; reference the
variables/values keywords, queries and the title_filter.positive source to
locate the code to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 44a38f2b-fc19-48bb-a1a9-258ca31d048d

📥 Commits

Reviewing files that changed from the base of the PR and between 82f0c2e and cdb7f4d.

📒 Files selected for processing (7)
  • AGENTS.md
  • docs/SCRIPTS.md
  • docs/SETUP.md
  • modes/_shared.md
  • package.json
  • scan-indeed.py
  • templates/portals.example.yml

Comment thread AGENTS.md
| `followup-cadence.mjs` | Follow-up cadence calculator (JSON output) |
| `data/follow-ups.md` | Follow-up history tracker |
| `scan.mjs` | Zero-token portal scanner — hits Greenhouse/Ashby/Lever APIs directly, zero LLM cost |
| `scan-indeed.py` | Indeed Ireland scanner — uses Scrapling to bypass Cloudflare, applies title_filter from portals.yml |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add Gemini-specific guidance in this AGENTS.md update.

This change touches AGENTS.md, but it does not add Gemini-only instructions required by repository policy.

As per coding guidelines, AGENTS.md: “Add Gemini-specific guidelines that other agents don't need to the AGENTS.md documentation”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@AGENTS.md` at line 68, Update AGENTS.md to include a Gemini-specific
subsection alongside the existing agent entries (e.g., near the `scan-indeed.py`
line) that documents any Gemini-only requirements: specify protocol differences
(URL scheme and recommended ports), content formatting expectations (text/gemini
MIME or plain text rules), recommended user-agent and rate-limiting behavior for
Gemini crawlers, and any portal-specific title_filter handling differences for
Gemini resources; reference “Gemini” explicitly so reviewers can find the
guidance easily and ensure it only contains items that apply exclusively to
Gemini agents.

Comment thread docs/SCRIPTS.md
| `npm run rollback` | `update-system.mjs rollback` | Rollback last update |
| `npm run liveness` | `check-liveness.mjs` | Test if job URLs are still active |
| `npm run scan` | `scan.mjs` | Zero-token portal scanner |
| `python3 scan-indeed.py` | `scan-indeed.py` | Indeed Ireland scanner (Scrapling) |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align command naming with package.json script alias.

The docs show python3 scan-indeed.py, but the project script is npm run scan:indeed. Adding the npm form here avoids command mismatch for users following the scripts reference.

Also applies to: 194-202

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/SCRIPTS.md` at line 21, Update the command examples in SCRIPTS.md to use
the npm script alias instead of the raw Python invocation: replace occurrences
of "python3 scan-indeed.py" with "npm run scan:indeed" (and similarly update the
other instance noted around lines 194-202) so the docs match the project's
package.json script names; search for any remaining "scan-indeed.py" examples
and standardize them to the npm form used by the project.

Comment thread scan-indeed.py Outdated
Comment thread scan-indeed.py
Comment thread scan-indeed.py Outdated

# Summary
print(f"\n{'━' * 50}")
print(f"📊 Results:")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove constant f-strings to resolve lint errors.

Lines 237, 244, and 271 use f-strings with no placeholders (Ruff F541), which can fail CI if lint is enforced.

Suggested fix
-    print(f"📊 Results:")
+    print("📊 Results:")
@@
-        print(f"\n✅ New offers:")
+        print("\n✅ New offers:")
@@
-    print(f"\n→ Run /career-ops pipeline to evaluate new offers.")
+    print("\n→ Run /career-ops pipeline to evaluate new offers.")

Also applies to: 244-244, 271-271

🧰 Tools
🪛 Ruff (0.15.13)

[error] 237-237: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan-indeed.py` at line 237, The prints use unnecessary f-strings (e.g.,
print(f"📊 Results:"), print(f"…"), print(f"…")) which triggers Ruff F541;
remove the leading 'f' on those string literals so they become normal strings
(e.g., print("📊 Results:")) for all three occurrences (the three print(...)
calls shown in the diff) to resolve the lint errors.

Adds a Python-based scanner for ie.indeed.com using Scrapling to bypass
Cloudflare bot protection. Searches Indeed Ireland for configured queries,
applies title_filter from portals.yml, deduplicates against existing
history, and appends new offers to pipeline.md + scan-history.tsv.

- New script: scan-indeed.py (Python, requires scrapling + pyyaml)
- New npm script: scan:indeed
- New portals.yml section: indeed_queries (optional, auto-generates if absent)
- Docs: AGENTS.md, SCRIPTS.md, SETUP.md, _shared.md updated
- Web scraping escalation ladder documented (Fetcher → StealthyFetcher → CloakBrowser)

Usage:
  python3 scan-indeed.py              # scan all queries
  python3 scan-indeed.py --dry-run    # preview without writing
  npm run scan:indeed                 # same via npm
@kejiali kejiali force-pushed the feat/scan-indeed-ireland branch from f87f6f7 to 7c02e8a Compare May 21, 2026 21:35
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scan-indeed.py`:
- Around line 174-175: The script currently accepts non-positive --pages values
which makes range(args.pages) silently skip work; after parsing
(parser.add_argument("--pages"...), args = parser.parse_args()) validate that
args.pages is an integer > 0 and reject otherwise, e.g. call parser.error or
sys.exit with a clear message like "--pages must be a positive integer" so
callers get a failure instead of misleading successful output; update the
validation near where args.pages is used (the range(args.pages) loop) to rely on
this check.
- Around line 232-233: The TSV history write is inserting raw job['title'] and
job['company'], which can contain '\t' or '\n' and will break parsing;
sanitize/escape these fields before writing to SCAN_HISTORY_PATH. Add a helper
like sanitize_tsv_field(value) that normalizes None to empty string and replaces
tabs/newlines (e.g., map '\t' and '\n' to a single space or escape sequence) or
switch the write to use Python's csv.writer with delimiter='\t' to handle
quoting, and use that when writing the f.write in the block that writes
f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n").
Apply the same fix to the analogous write at the other location (the block
around the 269-271 pattern).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ff9226bd-5e0b-48f2-be31-5db430fac61a

📥 Commits

Reviewing files that changed from the base of the PR and between cdb7f4d and 7c02e8a.

📒 Files selected for processing (7)
  • AGENTS.md
  • docs/SCRIPTS.md
  • docs/SETUP.md
  • modes/_shared.md
  • package.json
  • scan-indeed.py
  • templates/portals.example.yml

Comment thread scan-indeed.py
Comment on lines +174 to +175
parser.add_argument("--pages", type=int, default=2, help="Pages per query (default: 2)")
args = parser.parse_args()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate --pages to reject non-positive values.

At Line 206, range(args.pages) with 0 or negative values silently skips scanning and reports misleadingly successful output.

Suggested fix
 parser.add_argument("--pages", type=int, default=2, help="Pages per query (default: 2)")
 args = parser.parse_args()
+if args.pages < 1:
+    parser.error("--pages must be >= 1")

Also applies to: 206-207

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan-indeed.py` around lines 174 - 175, The script currently accepts
non-positive --pages values which makes range(args.pages) silently skip work;
after parsing (parser.add_argument("--pages"...), args = parser.parse_args())
validate that args.pages is an integer > 0 and reject otherwise, e.g. call
parser.error or sys.exit with a clear message like "--pages must be a positive
integer" so callers get a failure instead of misleading successful output;
update the validation near where args.pages is used (the range(args.pages) loop)
to rely on this check.

Comment thread scan-indeed.py
Comment on lines +232 to +233
with open(SCAN_HISTORY_PATH, "a") as f:
f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Escape TSV fields before writing history rows.

job['title'] and job['company'] are written raw into tab-separated rows. If either contains \t or \n, scan-history.tsv becomes malformed and downstream parsing/dedup can break.

Suggested fix
+def _tsv_safe(value):
+    return str(value).replace("\t", " ").replace("\n", " ").strip()
+
 ...
-                    f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n")
+                    f.write(
+                        f"{_tsv_safe(job['url'])}\t{today}\tindeed-ie\t"
+                        f"{_tsv_safe(job['title'])}\t{_tsv_safe(job['company'])}\tskipped_title\n"
+                    )
 ...
-                f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tadded\n")
+                f.write(
+                    f"{_tsv_safe(job['url'])}\t{today}\tindeed-ie\t"
+                    f"{_tsv_safe(job['title'])}\t{_tsv_safe(job['company'])}\tadded\n"
+                )

Also applies to: 269-271

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scan-indeed.py` around lines 232 - 233, The TSV history write is inserting
raw job['title'] and job['company'], which can contain '\t' or '\n' and will
break parsing; sanitize/escape these fields before writing to SCAN_HISTORY_PATH.
Add a helper like sanitize_tsv_field(value) that normalizes None to empty string
and replaces tabs/newlines (e.g., map '\t' and '\n' to a single space or escape
sequence) or switch the write to use Python's csv.writer with delimiter='\t' to
handle quoting, and use that when writing the f.write in the block that writes
f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n").
Apply the same fix to the analogous write at the other location (the block
around the 269-271 pattern).

@kejiali
Copy link
Copy Markdown
Author

kejiali commented May 21, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Indeed Ireland scanner support

1 participant