feat: add Indeed Ireland scanner (scan-indeed.py)#719
Conversation
|
Welcome to career-ops, @kejiali! Thanks for your first PR. A few things to know:
We'll review your PR soon. Join our Discord if you have questions. |
📝 WalkthroughWalkthroughAdded scan-indeed.py, a Python scanner for Indeed Ireland job listings using Scrapling to bypass Cloudflare. The script loads query and title-filter configuration from ChangesIndeed Ireland Scanner Implementation
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
cdb7f4d to
f87f6f7
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@AGENTS.md`:
- Line 68: Update AGENTS.md to include a Gemini-specific subsection alongside
the existing agent entries (e.g., near the `scan-indeed.py` line) that documents
any Gemini-only requirements: specify protocol differences (URL scheme and
recommended ports), content formatting expectations (text/gemini MIME or plain
text rules), recommended user-agent and rate-limiting behavior for Gemini
crawlers, and any portal-specific title_filter handling differences for Gemini
resources; reference “Gemini” explicitly so reviewers can find the guidance
easily and ensure it only contains items that apply exclusively to Gemini
agents.
In `@docs/SCRIPTS.md`:
- Line 21: Update the command examples in SCRIPTS.md to use the npm script alias
instead of the raw Python invocation: replace occurrences of "python3
scan-indeed.py" with "npm run scan:indeed" (and similarly update the other
instance noted around lines 194-202) so the docs match the project's
package.json script names; search for any remaining "scan-indeed.py" examples
and standardize them to the npm form used by the project.
In `@scan-indeed.py`:
- Line 237: The prints use unnecessary f-strings (e.g., print(f"📊 Results:"),
print(f"…"), print(f"…")) which triggers Ruff F541; remove the leading 'f' on
those string literals so they become normal strings (e.g., print("📊 Results:"))
for all three occurrences (the three print(...) calls shown in the diff) to
resolve the lint errors.
- Around line 229-231: The append to SCAN_HISTORY_PATH (inside the if not
args.dry_run block that writes job['url'] ... skipped_title) can create the file
without a header when it doesn't exist; before any writes to SCAN_HISTORY_PATH
ensure the file exists and has the TSV header (e.g., check
os.path.exists(SCAN_HISTORY_PATH) or open with 'x' to create) and write the
header row once (use the same header used later around the code that currently
initializes the header), then proceed to append; apply the same pre-check/header
initialization for the other append site referenced around the later lines (the
second write at lines 264-265) so both write sites rely on the same
header-initialization logic.
- Around line 103-105: The fallback query creation uses keywords[0] and will
IndexError if title_filter.positive (the source of keywords) is empty; update
the logic around queries/keywords (the block that sets queries when not queries)
to first check whether keywords (or title_filter.positive) is non-empty and only
create a fallback query when a keyword exists, otherwise leave queries empty so
the subsequent "no queries configured" handling runs; reference the
variables/values keywords, queries and the title_filter.positive source to
locate the code to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 44a38f2b-fc19-48bb-a1a9-258ca31d048d
📒 Files selected for processing (7)
AGENTS.mddocs/SCRIPTS.mddocs/SETUP.mdmodes/_shared.mdpackage.jsonscan-indeed.pytemplates/portals.example.yml
| | `followup-cadence.mjs` | Follow-up cadence calculator (JSON output) | | ||
| | `data/follow-ups.md` | Follow-up history tracker | | ||
| | `scan.mjs` | Zero-token portal scanner — hits Greenhouse/Ashby/Lever APIs directly, zero LLM cost | | ||
| | `scan-indeed.py` | Indeed Ireland scanner — uses Scrapling to bypass Cloudflare, applies title_filter from portals.yml | |
There was a problem hiding this comment.
Add Gemini-specific guidance in this AGENTS.md update.
This change touches AGENTS.md, but it does not add Gemini-only instructions required by repository policy.
As per coding guidelines, AGENTS.md: “Add Gemini-specific guidelines that other agents don't need to the AGENTS.md documentation”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@AGENTS.md` at line 68, Update AGENTS.md to include a Gemini-specific
subsection alongside the existing agent entries (e.g., near the `scan-indeed.py`
line) that documents any Gemini-only requirements: specify protocol differences
(URL scheme and recommended ports), content formatting expectations (text/gemini
MIME or plain text rules), recommended user-agent and rate-limiting behavior for
Gemini crawlers, and any portal-specific title_filter handling differences for
Gemini resources; reference “Gemini” explicitly so reviewers can find the
guidance easily and ensure it only contains items that apply exclusively to
Gemini agents.
| | `npm run rollback` | `update-system.mjs rollback` | Rollback last update | | ||
| | `npm run liveness` | `check-liveness.mjs` | Test if job URLs are still active | | ||
| | `npm run scan` | `scan.mjs` | Zero-token portal scanner | | ||
| | `python3 scan-indeed.py` | `scan-indeed.py` | Indeed Ireland scanner (Scrapling) | |
There was a problem hiding this comment.
Align command naming with package.json script alias.
The docs show python3 scan-indeed.py, but the project script is npm run scan:indeed. Adding the npm form here avoids command mismatch for users following the scripts reference.
Also applies to: 194-202
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/SCRIPTS.md` at line 21, Update the command examples in SCRIPTS.md to use
the npm script alias instead of the raw Python invocation: replace occurrences
of "python3 scan-indeed.py" with "npm run scan:indeed" (and similarly update the
other instance noted around lines 194-202) so the docs match the project's
package.json script names; search for any remaining "scan-indeed.py" examples
and standardize them to the npm form used by the project.
|
|
||
| # Summary | ||
| print(f"\n{'━' * 50}") | ||
| print(f"📊 Results:") |
There was a problem hiding this comment.
Remove constant f-strings to resolve lint errors.
Lines 237, 244, and 271 use f-strings with no placeholders (Ruff F541), which can fail CI if lint is enforced.
Suggested fix
- print(f"📊 Results:")
+ print("📊 Results:")
@@
- print(f"\n✅ New offers:")
+ print("\n✅ New offers:")
@@
- print(f"\n→ Run /career-ops pipeline to evaluate new offers.")
+ print("\n→ Run /career-ops pipeline to evaluate new offers.")Also applies to: 244-244, 271-271
🧰 Tools
🪛 Ruff (0.15.13)
[error] 237-237: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scan-indeed.py` at line 237, The prints use unnecessary f-strings (e.g.,
print(f"📊 Results:"), print(f"…"), print(f"…")) which triggers Ruff F541;
remove the leading 'f' on those string literals so they become normal strings
(e.g., print("📊 Results:")) for all three occurrences (the three print(...)
calls shown in the diff) to resolve the lint errors.
Adds a Python-based scanner for ie.indeed.com using Scrapling to bypass Cloudflare bot protection. Searches Indeed Ireland for configured queries, applies title_filter from portals.yml, deduplicates against existing history, and appends new offers to pipeline.md + scan-history.tsv. - New script: scan-indeed.py (Python, requires scrapling + pyyaml) - New npm script: scan:indeed - New portals.yml section: indeed_queries (optional, auto-generates if absent) - Docs: AGENTS.md, SCRIPTS.md, SETUP.md, _shared.md updated - Web scraping escalation ladder documented (Fetcher → StealthyFetcher → CloakBrowser) Usage: python3 scan-indeed.py # scan all queries python3 scan-indeed.py --dry-run # preview without writing npm run scan:indeed # same via npm
f87f6f7 to
7c02e8a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scan-indeed.py`:
- Around line 174-175: The script currently accepts non-positive --pages values
which makes range(args.pages) silently skip work; after parsing
(parser.add_argument("--pages"...), args = parser.parse_args()) validate that
args.pages is an integer > 0 and reject otherwise, e.g. call parser.error or
sys.exit with a clear message like "--pages must be a positive integer" so
callers get a failure instead of misleading successful output; update the
validation near where args.pages is used (the range(args.pages) loop) to rely on
this check.
- Around line 232-233: The TSV history write is inserting raw job['title'] and
job['company'], which can contain '\t' or '\n' and will break parsing;
sanitize/escape these fields before writing to SCAN_HISTORY_PATH. Add a helper
like sanitize_tsv_field(value) that normalizes None to empty string and replaces
tabs/newlines (e.g., map '\t' and '\n' to a single space or escape sequence) or
switch the write to use Python's csv.writer with delimiter='\t' to handle
quoting, and use that when writing the f.write in the block that writes
f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n").
Apply the same fix to the analogous write at the other location (the block
around the 269-271 pattern).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ff9226bd-5e0b-48f2-be31-5db430fac61a
📒 Files selected for processing (7)
AGENTS.mddocs/SCRIPTS.mddocs/SETUP.mdmodes/_shared.mdpackage.jsonscan-indeed.pytemplates/portals.example.yml
| parser.add_argument("--pages", type=int, default=2, help="Pages per query (default: 2)") | ||
| args = parser.parse_args() |
There was a problem hiding this comment.
Validate --pages to reject non-positive values.
At Line 206, range(args.pages) with 0 or negative values silently skips scanning and reports misleadingly successful output.
Suggested fix
parser.add_argument("--pages", type=int, default=2, help="Pages per query (default: 2)")
args = parser.parse_args()
+if args.pages < 1:
+ parser.error("--pages must be >= 1")Also applies to: 206-207
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scan-indeed.py` around lines 174 - 175, The script currently accepts
non-positive --pages values which makes range(args.pages) silently skip work;
after parsing (parser.add_argument("--pages"...), args = parser.parse_args())
validate that args.pages is an integer > 0 and reject otherwise, e.g. call
parser.error or sys.exit with a clear message like "--pages must be a positive
integer" so callers get a failure instead of misleading successful output;
update the validation near where args.pages is used (the range(args.pages) loop)
to rely on this check.
| with open(SCAN_HISTORY_PATH, "a") as f: | ||
| f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n") |
There was a problem hiding this comment.
Escape TSV fields before writing history rows.
job['title'] and job['company'] are written raw into tab-separated rows. If either contains \t or \n, scan-history.tsv becomes malformed and downstream parsing/dedup can break.
Suggested fix
+def _tsv_safe(value):
+ return str(value).replace("\t", " ").replace("\n", " ").strip()
+
...
- f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n")
+ f.write(
+ f"{_tsv_safe(job['url'])}\t{today}\tindeed-ie\t"
+ f"{_tsv_safe(job['title'])}\t{_tsv_safe(job['company'])}\tskipped_title\n"
+ )
...
- f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tadded\n")
+ f.write(
+ f"{_tsv_safe(job['url'])}\t{today}\tindeed-ie\t"
+ f"{_tsv_safe(job['title'])}\t{_tsv_safe(job['company'])}\tadded\n"
+ )Also applies to: 269-271
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scan-indeed.py` around lines 232 - 233, The TSV history write is inserting
raw job['title'] and job['company'], which can contain '\t' or '\n' and will
break parsing; sanitize/escape these fields before writing to SCAN_HISTORY_PATH.
Add a helper like sanitize_tsv_field(value) that normalizes None to empty string
and replaces tabs/newlines (e.g., map '\t' and '\n' to a single space or escape
sequence) or switch the write to use Python's csv.writer with delimiter='\t' to
handle quoting, and use that when writing the f.write in the block that writes
f.write(f"{job['url']}\t{today}\tindeed-ie\t{job['title']}\t{job['company']}\tskipped_title\n").
Apply the same fix to the analogous write at the other location (the block
around the 269-271 pattern).
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Closes #720
What
Adds a Python-based scanner for ie.indeed.com (Ireland) using Scrapling to bypass Cloudflare bot protection. Searches Indeed Ireland for configured queries, applies title_filter from portals.yml, deduplicates against existing history, and appends new offers to pipeline.md + scan-history.tsv.
Why
The existing
scan.mjscovers Greenhouse/Ashby/Lever APIs but misses Indeed Ireland — Ireland's largest job search website and a primary source for local roles. Indeed uses Cloudflare bot protection that blocks standard HTTP clients, so Scrapling is needed.Changes
scan-indeed.py— the scanner script (Python)npm run scan:indeedscript in package.jsonindeed_queriessection intemplates/portals.example.ymlAGENTS.md— main files tablemodes/_shared.md— tools + web scraping escalation ladderdocs/SCRIPTS.md— script referencedocs/SETUP.md— optional Python dependencyIndeed Bot Protection — Tested Approaches
web_fetchFetcherStealthyFetcherhumanize=TrueDependencies
scrapling(Python, pip)pyyaml(Python, pip)Usage
Tested
--dry-runmode works without writing filesSecurity
Scanned with ASH v3.2.5 (Bandit + Semgrep + detect-secrets): 0 findings.
Summary by CodeRabbit
New Features
Documentation