feat: add Indeed Ireland scanner (scan-indeed.py) by kejiali · Pull Request #719 · santifer/career-ops

kejiali · 2026-05-21T21:25:00Z

Closes #720

What

Adds a Python-based scanner for ie.indeed.com (Ireland) using Scrapling to bypass Cloudflare bot protection. Searches Indeed Ireland for configured queries, applies title_filter from portals.yml, deduplicates against existing history, and appends new offers to pipeline.md + scan-history.tsv.

Why

The existing scan.mjs covers Greenhouse/Ashby/Lever APIs but misses Indeed Ireland — Ireland's largest job search website and a primary source for local roles. Indeed uses Cloudflare bot protection that blocks standard HTTP clients, so Scrapling is needed.

Changes

New: scan-indeed.py — the scanner script (Python)
New: npm run scan:indeed script in package.json
New: indeed_queries section in templates/portals.example.yml
Updated: AGENTS.md — main files table
Updated: modes/_shared.md — tools + web scraping escalation ladder
Updated: docs/SCRIPTS.md — script reference
Updated: docs/SETUP.md — optional Python dependency

Indeed Bot Protection — Tested Approaches

Tool	Indeed Result	Notes
`web_fetch`	✅ Works	Surprisingly passes for search results
Jina Reader	❌ Blocked	Cloudflare 403
Scrapling `Fetcher`	✅ Works	Used in this PR — fast, lightweight, bypasses Cloudflare
Scrapling `StealthyFetcher`	✅ Works	Fallback if Fetcher stops working (real browser)
CloakBrowser	✅ Works	Nuclear fallback — 58 C++ source-level Chromium patches, `humanize=True`

Dependencies

scrapling (Python, pip)
pyyaml (Python, pip)

Usage

python3 scan-indeed.py              # scan all queries, 2 pages each
python3 scan-indeed.py --dry-run    # preview without writing files
python3 scan-indeed.py --pages 3    # scan up to 3 pages per query
npm run scan:indeed                 # same via npm

Tested

Verified against live ie.indeed.com (38 results, 18 passed filters)
Dedup against existing scan-history/pipeline/applications works
--dry-run mode works without writing files

Security

Scanned with ASH v3.2.5 (Bandit + Semgrep + detect-secrets): 0 findings.

Summary by CodeRabbit

New Features
- Added an Indeed Ireland job scanner that fetches and filters job listings, dedupes against scan history, and appends new matches to the pipeline.
- CLI support for dry-run and pagination; script exposed via a package script.
Documentation
- Updated setup prerequisites for optional Python tooling.
- Added scanner usage docs, query configuration guidance, and a web-scraping escalation guide.

github-actions · 2026-05-21T21:25:09Z

Welcome to career-ops, @kejiali! Thanks for your first PR.

A few things to know:

Tests will run automatically — check the status below
Make sure you've linked a related issue (required for features)
Read CONTRIBUTING.md if you haven't

We'll review your PR soon. Join our Discord if you have questions.

coderabbitai · 2026-05-21T21:26:24Z

📝 Walkthrough

Walkthrough

Added scan-indeed.py, a Python scanner for Indeed Ireland job listings using Scrapling to bypass Cloudflare. The script loads query and title-filter configuration from portals.yml, scrapes search results, deduplicates against existing pipeline/history, and appends new jobs to pipeline.md and scan-history.tsv. Integrated via npm script, documented in setup and mode guides.

Changes

Indeed Ireland Scanner Implementation

Layer / File(s)	Summary
Portal configuration and prerequisites `templates/portals.example.yml`, `docs/SETUP.md`	Added `indeed_queries` section with four example Dublin/Ireland job search configurations and updated setup docs to mention Python 3.10+ with `scrapling` and `pyyaml`.
Core scanner implementation `scan-indeed.py`	Implemented complete Indeed Ireland scraper: module entrypoint and paths, `load_portals()`/`load_seen_urls()`, `build_title_filter()`, `get_indeed_queries()`, `_first()` helper, `scrape_indeed_page()` using Scrapling Fetcher, CLI `main()` with `--dry-run`/`--pages`, in-run dedupe and title filtering, and file writes to `data/pipeline.md` and `data/scan-history.tsv`.
Project integration and documentation `package.json`, `AGENTS.md`, `docs/SCRIPTS.md`, `modes/_shared.md`	Added `scan:indeed` npm script, updated AGENTS.md Main Files table, added SCRIPTS.md quick-reference row and full `scan-indeed` section (usage, deps, exit codes), and extended `modes/_shared.md` Tools list and Web Scraping Escalation Ladder.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly summarizes the main change: adding a new Indeed Ireland scanner script.
Linked Issues check	✅ Passed	All coding requirements from issue `#720` are met: scan-indeed.py implements Scrapling-based scraping of Indeed Ireland, supports configurable queries with title_filter fallback, applies deduplication, provides --dry-run mode, writes to pipeline.md and scan-history.tsv, and documents escalation paths.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the Indeed Ireland scanner: the main script, npm script entry, configuration template, and documentation updates for tools and setup.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@AGENTS.md`:
- Line 68: Update AGENTS.md to include a Gemini-specific subsection alongside
the existing agent entries (e.g., near the `scan-indeed.py` line) that documents
any Gemini-only requirements: specify protocol differences (URL scheme and
recommended ports), content formatting expectations (text/gemini MIME or plain
text rules), recommended user-agent and rate-limiting behavior for Gemini
crawlers, and any portal-specific title_filter handling differences for Gemini
resources; reference “Gemini” explicitly so reviewers can find the guidance
easily and ensure it only contains items that apply exclusively to Gemini
agents.

In `@docs/SCRIPTS.md`:
- Line 21: Update the command examples in SCRIPTS.md to use the npm script alias
instead of the raw Python invocation: replace occurrences of "python3
scan-indeed.py" with "npm run scan:indeed" (and similarly update the other
instance noted around lines 194-202) so the docs match the project's
package.json script names; search for any remaining "scan-indeed.py" examples
and standardize them to the npm form used by the project.

In `@scan-indeed.py`:
- Line 237: The prints use unnecessary f-strings (e.g., print(f"📊 Results:"),
print(f"…"), print(f"…")) which triggers Ruff F541; remove the leading 'f' on
those string literals so they become normal strings (e.g., print("📊 Results:"))
for all three occurrences (the three print(...) calls shown in the diff) to
resolve the lint errors.
- Around line 229-231: The append to SCAN_HISTORY_PATH (inside the if not
args.dry_run block that writes job['url'] ... skipped_title) can create the file
without a header when it doesn't exist; before any writes to SCAN_HISTORY_PATH
ensure the file exists and has the TSV header (e.g., check
os.path.exists(SCAN_HISTORY_PATH) or open with 'x' to create) and write the
header row once (use the same header used later around the code that currently
initializes the header), then proceed to append; apply the same pre-check/header
initialization for the other append site referenced around the later lines (the
second write at lines 264-265) so both write sites rely on the same
header-initialization logic.
- Around line 103-105: The fallback query creation uses keywords[0] and will
IndexError if title_filter.positive (the source of keywords) is empty; update
the logic around queries/keywords (the block that sets queries when not queries)
to first check whether keywords (or title_filter.positive) is non-empty and only
create a fallback query when a keyword exists, otherwise leave queries empty so
the subsequent "no queries configured" handling runs; reference the
variables/values keywords, queries and the title_filter.positive source to
locate the code to change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 44a38f2b-fc19-48bb-a1a9-258ca31d048d

📥 Commits

Reviewing files that changed from the base of the PR and between 82f0c2e and cdb7f4d.

📒 Files selected for processing (7)

AGENTS.md
docs/SCRIPTS.md
docs/SETUP.md
modes/_shared.md
package.json
scan-indeed.py
templates/portals.example.yml

coderabbitai · 2026-05-21T21:33:50Z

 | `followup-cadence.mjs` | Follow-up cadence calculator (JSON output) |
 | `data/follow-ups.md` | Follow-up history tracker |
 | `scan.mjs` | Zero-token portal scanner — hits Greenhouse/Ashby/Lever APIs directly, zero LLM cost |
+| `scan-indeed.py` | Indeed Ireland scanner — uses Scrapling to bypass Cloudflare, applies title_filter from portals.yml |


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add Gemini-specific guidance in this AGENTS.md update.

This change touches AGENTS.md, but it does not add Gemini-only instructions required by repository policy.

As per coding guidelines, AGENTS.md: “Add Gemini-specific guidelines that other agents don't need to the AGENTS.md documentation”.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@AGENTS.md` at line 68, Update AGENTS.md to include a Gemini-specific subsection alongside the existing agent entries (e.g., near the `scan-indeed.py` line) that documents any Gemini-only requirements: specify protocol differences (URL scheme and recommended ports), content formatting expectations (text/gemini MIME or plain text rules), recommended user-agent and rate-limiting behavior for Gemini crawlers, and any portal-specific title_filter handling differences for Gemini resources; reference “Gemini” explicitly so reviewers can find the guidance easily and ensure it only contains items that apply exclusively to Gemini agents.

coderabbitai · 2026-05-21T21:33:50Z

 | `npm run rollback` | `update-system.mjs rollback` | Rollback last update |
 | `npm run liveness` | `check-liveness.mjs` | Test if job URLs are still active |
 | `npm run scan` | `scan.mjs` | Zero-token portal scanner |
+| `python3 scan-indeed.py` | `scan-indeed.py` | Indeed Ireland scanner (Scrapling) |


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align command naming with package.json script alias.

The docs show python3 scan-indeed.py, but the project script is npm run scan:indeed. Adding the npm form here avoids command mismatch for users following the scripts reference.

Also applies to: 194-202

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/SCRIPTS.md` at line 21, Update the command examples in SCRIPTS.md to use the npm script alias instead of the raw Python invocation: replace occurrences of "python3 scan-indeed.py" with "npm run scan:indeed" (and similarly update the other instance noted around lines 194-202) so the docs match the project's package.json script names; search for any remaining "scan-indeed.py" examples and standardize them to the npm form used by the project.

coderabbitai · 2026-05-21T21:33:51Z

+
+    # Summary
+    print(f"\n{'━' * 50}")
+    print(f"📊 Results:")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove constant f-strings to resolve lint errors.

Lines 237, 244, and 271 use f-strings with no placeholders (Ruff F541), which can fail CI if lint is enforced.

Suggested fix

- print(f"📊 Results:") + print("📊 Results:") @@ - print(f"\n✅ New offers:") + print("\n✅ New offers:") @@ - print(f"\n→ Run /career-ops pipeline to evaluate new offers.") + print("\n→ Run /career-ops pipeline to evaluate new offers.")

Also applies to: 244-244, 271-271

🧰 Tools

🪛 Ruff (0.15.13)

[error] 237-237: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan-indeed.py` at line 237, The prints use unnecessary f-strings (e.g., print(f"📊 Results:"), print(f"…"), print(f"…")) which triggers Ruff F541; remove the leading 'f' on those string literals so they become normal strings (e.g., print("📊 Results:")) for all three occurrences (the three print(...) calls shown in the diff) to resolve the lint errors.

Adds a Python-based scanner for ie.indeed.com using Scrapling to bypass Cloudflare bot protection. Searches Indeed Ireland for configured queries, applies title_filter from portals.yml, deduplicates against existing history, and appends new offers to pipeline.md + scan-history.tsv. - New script: scan-indeed.py (Python, requires scrapling + pyyaml) - New npm script: scan:indeed - New portals.yml section: indeed_queries (optional, auto-generates if absent) - Docs: AGENTS.md, SCRIPTS.md, SETUP.md, _shared.md updated - Web scraping escalation ladder documented (Fetcher → StealthyFetcher → CloakBrowser) Usage: python3 scan-indeed.py # scan all queries python3 scan-indeed.py --dry-run # preview without writing npm run scan:indeed # same via npm

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scan-indeed.py`:
- Around line 174-175: The script currently accepts non-positive --pages values
which makes range(args.pages) silently skip work; after parsing
(parser.add_argument("--pages"...), args = parser.parse_args()) validate that
args.pages is an integer > 0 and reject otherwise, e.g. call parser.error or
sys.exit with a clear message like "--pages must be a positive integer" so
callers get a failure instead of misleading successful output; update the
validation near where args.pages is used (the range(args.pages) loop) to rely on
this check.
- Around line 232-233: The TSV history write is inserting raw job['title'] and
job['company'], which can contain '\t' or '\n' and will break parsing;
sanitize/escape these fields before writing to SCAN_HISTORY_PATH. Add a helper
like sanitize_tsv_field(value) that normalizes None to empty string and replaces
tabs/newlines (e.g., map '\t' and '\n' to a single space or escape sequence) or
switch the write to use Python's csv.writer with delimiter='\t' to handle
quoting, and use that when writing the f.write in the block that writes
f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n").
Apply the same fix to the analogous write at the other location (the block
around the 269-271 pattern).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ff9226bd-5e0b-48f2-be31-5db430fac61a

📥 Commits

Reviewing files that changed from the base of the PR and between cdb7f4d and 7c02e8a.

📒 Files selected for processing (7)

AGENTS.md
docs/SCRIPTS.md
docs/SETUP.md
modes/_shared.md
package.json
scan-indeed.py
templates/portals.example.yml

coderabbitai · 2026-05-21T21:38:51Z

+    parser.add_argument("--pages", type=int, default=2, help="Pages per query (default: 2)")
+    args = parser.parse_args()


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate --pages to reject non-positive values.

At Line 206, range(args.pages) with 0 or negative values silently skips scanning and reports misleadingly successful output.

Suggested fix

parser.add_argument("--pages", type=int, default=2, help="Pages per query (default: 2)") args = parser.parse_args() +if args.pages < 1: + parser.error("--pages must be >= 1")

Also applies to: 206-207

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan-indeed.py` around lines 174 - 175, The script currently accepts non-positive --pages values which makes range(args.pages) silently skip work; after parsing (parser.add_argument("--pages"...), args = parser.parse_args()) validate that args.pages is an integer > 0 and reject otherwise, e.g. call parser.error or sys.exit with a clear message like "--pages must be a positive integer" so callers get a failure instead of misleading successful output; update the validation near where args.pages is used (the range(args.pages) loop) to rely on this check.

coderabbitai · 2026-05-21T21:38:51Z

+                with open(SCAN_HISTORY_PATH, "a") as f:
+                    f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Escape TSV fields before writing history rows.

job['title'] and job['company'] are written raw into tab-separated rows. If either contains \t or \n, scan-history.tsv becomes malformed and downstream parsing/dedup can break.

Suggested fix

+def _tsv_safe(value): + return str(value).replace("\t", " ").replace("\n", " ").strip() + ... - f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n") + f.write( + f"{_tsv_safe(job['url'])}\t{today}\tindeed-ie\t" + f"{_tsv_safe(job['title'])}\t{_tsv_safe(job['company'])}\tskipped_title\n" + ) ... - f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tadded\n") + f.write( + f"{_tsv_safe(job['url'])}\t{today}\tindeed-ie\t" + f"{_tsv_safe(job['title'])}\t{_tsv_safe(job['company'])}\tadded\n" + )

Also applies to: 269-271

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scan-indeed.py` around lines 232 - 233, The TSV history write is inserting raw job['title'] and job['company'], which can contain '\t' or '\n' and will break parsing; sanitize/escape these fields before writing to SCAN_HISTORY_PATH. Add a helper like sanitize_tsv_field(value) that normalizes None to empty string and replaces tabs/newlines (e.g., map '\t' and '\n' to a single space or escape sequence) or switch the write to use Python's csv.writer with delimiter='\t' to handle quoting, and use that when writing the f.write in the block that writes f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n"). Apply the same fix to the analogous write at the other location (the block around the 269-271 pattern).

kejiali · 2026-05-21T21:41:16Z

@coderabbitai review

coderabbitai · 2026-05-21T21:41:25Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

github-actions Bot added 📄 docs ⚙️ ci 🔴 core-architecture ⚠️ agent-behavior 🔧 scripts 📦 dependencies 📊 dashboard labels May 21, 2026

kejiali mentioned this pull request May 21, 2026

feat: Indeed Ireland scanner support #720

Open

kejiali force-pushed the feat/scan-indeed-ireland branch 2 times, most recently from cdb7f4d to f87f6f7 Compare May 21, 2026 21:33

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

kejiali force-pushed the feat/scan-indeed-ireland branch from f87f6f7 to 7c02e8a Compare May 21, 2026 21:35

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

		parser.add_argument("--pages", type=int, default=2, help="Pages per query (default: 2)")
		args = parser.parse_args()

		with open(SCAN_HISTORY_PATH, "a") as f:
		f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n")

Uh oh!

Conversation

kejiali commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Changes

Indeed Bot Protection — Tested Approaches

Dependencies

Usage

Tested

Security

Summary by CodeRabbit

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

kejiali commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kejiali commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading