Skip to content

Upgrade to Python 3.14, enable comprehensive ruff linting, and harden CI#5

Merged
brylie merged 7 commits into
mainfrom
feature/upgrade-dependencies-and-documentation
Jun 27, 2026
Merged

Upgrade to Python 3.14, enable comprehensive ruff linting, and harden CI#5
brylie merged 7 commits into
mainfrom
feature/upgrade-dependencies-and-documentation

Conversation

@brylie

@brylie brylie commented Jun 27, 2026

Copy link
Copy Markdown
Member

Summary

  • Upgrade the project to Python 3.14 only (devcontainer, CI matrix, mypy/ruff/pyrefly configs) and refresh the full dependency tree (langchain 1.x, gradio 6.x, pydantic 2.13, etc.) to versions with 3.14 wheels.
  • Switch from pre-commit to prek (faster, drop-in compatible) as the git hook runner, and run those same checks in CI so they're enforced regardless of whether a contributor has the local hook installed.
  • Expand ruff's lint selection from a minimal E/F/I/UP set to ~50 rule categories (bugbear, bandit, pathlib, pydocstyle, simplify, pylint, tryceratops, type-checking, and more) and fix every resulting finding — every ignore (global and per-file) carries an inline comment explaining why.
  • Surface CI results natively in the PR UI: inline Ruff annotations on "Files changed", a job summary with a pass/fail table for every check, and independent (non-fail-fast) checks so one push surfaces all problems at once.

Notable fixes beyond mechanical cleanup

  • Logging: f-string log calls converted to lazy %-formatting; root-logger calls replaced with named loggers; .error().exception() in except blocks so tracebacks are actually captured.
  • Pathlib migration: ~176 os.path call sites converted to pathlib.Path across the crawler/parser/vectorizer/config modules.
  • Reverted an attempt to hide imports behind TYPE_CHECKING (TC001-003) after it caused a real NameError — gradio's event-handler wiring eagerly resolves type hints, and this project has no from __future__ import annotations to make that safe. Documented and ignored instead.
  • Removed ChromaStore's reliance on langchain's private _collection attribute, rewriting query_with_embedding/get_document against the public Chroma API.
  • Implemented multi-turn chat history support in LLMService/RAGOrchestrator: the history parameter was flagged as unused (ARG002) but was actually wired in from the Gradio app and never consumed. Threaded it into the Ollama messages list (capped via MAX_HISTORY_MESSAGES) and fixed a double-counting bug where respond_stream included the just-appended current message in history.
  • Reduced real complexity (C901/PLR0911/PLR0915) in check_model_availability, BaseCrawler._crawl_url, and the parse CLI command via extraction, not threshold-raising.
  • CI: replaced mutating ruff check --fix/ruff format steps (which silently passed PRs with violations) with prek run --all-files, the same gate used locally.

Test plan

  • uv run pytest — 179 passed
  • uv run ruff check . / uv run ruff format --check . — clean
  • uv run mypy tapio — clean
  • uv run pyrefly check — clean
  • prek run --all-files — all hooks pass
  • CI run on this PR (verify the new annotations/job-summary/independent-checks behavior renders as expected)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Improved chat responses by preserving prior conversation context in more places.
    • Updated document retrieval to return more useful results with relevance scores and citation details.
  • Bug Fixes

    • Refined error handling across the app, crawler, parser, and vector search for more reliable failures and clearer fallback behavior.
    • Improved file/path handling for markdown, prompts, and crawl outputs.
  • Chores

    • Upgraded the project to Python 3.14 and refreshed development, testing, and CI settings.
    • Updated recommended editor extensions and developer tooling.

brylie and others added 5 commits June 27, 2026 13:06
Bumps the project to Python 3.14 across devcontainer, CI, mypy, ruff,
and pyrefly configs, and upgrades the full dependency tree (langchain
1.x, gradio 6.x, pydantic 2.13, etc.) to versions with 3.14 wheels.

- Replace pre-commit with prek (faster, drop-in compatible) as the
  git hook runner; bump the ruff/pyupgrade hook revs and pin
  default_language_version to python3.14 so hook environments match
  the project interpreter.
- gradio 6 dropped Chatbot(type="messages") now that the messages
  format is the only one; remove the obsolete kwarg.
- langchain 1.x removed the langchain.schema compat shim; switch to
  langchain_core.documents.Document.
- langchain-huggingface now gates sentence-transformers/transformers
  behind a "full" extra; request it explicitly (including in the mypy
  hook's additional_dependencies) so HuggingFaceEmbeddings keeps
  working and type-checks cleanly.
- click 8.2+ no longer mixes stderr into CliRunner stdout by default;
  update CLI exception-message assertions to check result.output.
- Apply ruff's py314-target modernizations (PEP 604/758 syntax,
  simplified Generator[] hints).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expands ruff's lint selection from a minimal E/F/I/UP set to ~50 rule
categories (bugbear, bandit, pathlib, pydocstyle, simplify, pylint,
tryceratops, type-checking, and more), with every ignore -- global and
per-file -- carrying an inline comment explaining why.

Highlights beyond mechanical cleanup:

- Logging: all f-string log calls converted to lazy %-formatting; bare
  root-logger calls replaced with named loggers; .error() -> .exception()
  in except blocks so tracebacks are actually captured.
- Migrated ~176 os.path call sites to pathlib.Path across the crawler,
  parser, vectorizer, and config modules; updated tests to mock Path
  methods instead of os.* accordingly.
- Reverted an initial attempt to move imports behind TYPE_CHECKING
  (TC001-003): it caused a real NameError because gradio's event-handler
  wiring eagerly resolves type hints, and this project has no
  `from __future__ import annotations` to make deferred imports safe.
  Documented and ignored instead.
- Removed ChromaStore's reliance on langchain's private `_collection`
  attribute, rewriting query_with_embedding/get_document against the
  public Chroma API.
- Implemented multi-turn chat history support in LLMService/
  RAGOrchestrator: the `history` parameter was flagged as unused
  (ARG002) but was actually wired in from the Gradio app and simply
  never consumed. Threads it into the ollama messages list (capped via
  MAX_HISTORY_MESSAGES) and fixes a double-counting bug where
  respond_stream included the just-appended current message in history.
- Reduced real complexity (C901/PLR0911/PLR0915) in
  check_model_availability, BaseCrawler._crawl_url, and the `parse` CLI
  command via extraction, rather than raising thresholds.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous CI lint steps used `ruff check --fix` and `ruff format`,
which mutate files in place instead of failing the build -- a PR with
unformatted or lint-violating code would pass CI silently. Since not
every contributor runs the local prek hook, replace those steps with
`prek run --all-files`, which fails on any finding (including the
yaml/json/toml checks, pyupgrade, and trailing-comma hooks that weren't
being run in CI at all). mypy and pyrefly stay as separate steps against
the project's own already-synced venv, rather than have prek spin up a
second, duplicate dependency set just for those two hooks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributors unfamiliar with GitHub Actions shouldn't need to open
raw step logs to see what failed. Adds three native GitHub Actions
features for in-PR visibility:

- Ruff findings now also run with --output-format=github, producing
  inline annotations on the "Files changed" tab.
- Every check (prek, mypy, pyrefly, pytest, coverage) now runs with
  continue-on-error so a single push surfaces all problems at once,
  instead of stopping at the first failure and requiring a fix/push/fail
  cycle to discover the next one.
- A job summary step renders a pass/fail table for all checks plus the
  coverage percentage on the workflow run's Summary tab.
- A final gating step explicitly fails the job if any check's outcome
  wasn't "success", preserving correct CI enforcement despite the
  continue-on-error used to let every check run independently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@brylie, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 36 minutes and 42 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: bd2ea044-1a58-4629-a626-4b0c9f186f80

📥 Commits

Reviewing files that changed from the base of the PR and between c19b70f and fc53795.

📒 Files selected for processing (8)
  • .github/workflows/ci.yaml
  • .vscode/extensions.json
  • README.md
  • tapio/crawler/crawler.py
  • tapio/parser/parser.py
  • tapio/services/document_retrieval_service.py
  • tapio/utils/text_utils.py
  • tests/vectorstore/test_chroma_store.py
📝 Walkthrough

Walkthrough

The PR updates project settings to Python 3.14, refactors file I/O to pathlib.Path, propagates chat history through the RAG flow, and adjusts vector-store and embedding APIs plus tests.

Changes

Python 3.14 upgrade and runtime refactors

Layer / File(s) Summary
Toolchain and CI baseline
.devcontainer/*, .github/workflows/ci.yaml, .pre-commit-config.yaml, .python-version, .vscode/extensions.json, README.md, mise.toml, mypy.ini, pyproject.toml, pyrefly.toml, pytest.ini
Project settings move to Python 3.14, refresh devcontainer and editor recommendations, update pre-commit and dependency pins, expand Ruff settings, and replace the CI lint/test sequence with prek plus summary and failure steps.
Config and markdown loaders
tapio/config/*, tapio/prompts/*, tapio/utils/markdown_utils.py, tests/config/*, tests/prompts/*, tests/utils/test_markdown_utils.py
Configuration loading, config validation, prompt loading, and markdown discovery use pathlib.Path I/O and exception logging; the matching tests patch Path methods.
Parser and crawler core
tapio/crawler/*, tapio/parser/*, tests/crawler/*, tests/parser/*
Crawler and parser storage, link extraction, and save/load helpers switch to Path-based handling and UTC timestamps, with crawler runner logging and parser tests updated to the new filesystem and logging behavior.
CLI command wiring
tapio/cli.py, tests/test_cli.py
CLI commands centralize crawled-site discovery and parse/vectorize routing with Path-based checks and standardized exits; the CLI tests follow the new traversal mocks and output assertions.
RAG history and chat flow
tapio/app.py, tapio/models/*, tapio/services/*, tests/integration/test_rag_pipeline.py, tests/services/*, tests/test_gradio_app.py
The RAG path forwards conversation history through the orchestrator, LLM service, and Gradio app, while the model and service exports, document retrieval, and tests align with the updated interfaces.
Vector store and embedding pipeline
tapio/factories.py, tapio/utils/__init__.py, tapio/utils/embedding_utils.py, tapio/utils/text_utils.py, tapio/vectorstore/*, tests/utils/test_text_utils.py, tests/vectorstore/*
The embedding, chunking, and vector-store code switches to Path-based metadata handling, LangChain document queries, citation enrichment, and revised logging; the vector-store and text-utility tests adjust to the new public methods.

Sequence Diagram(s)

sequenceDiagram
  participant TapioAssistantApp
  participant RAGOrchestrator
  participant LLMService
  participant Ollama
  TapioAssistantApp->>RAGOrchestrator: query_stream(query_text, history=prior_turns)
  RAGOrchestrator->>LLMService: generate_response_stream(prompt, system_prompt, history=prior_turns)
  LLMService->>Ollama: chat(messages from prompt, system_prompt, history)
  Ollama-->>LLMService: streamed chunks
  LLMService-->>RAGOrchestrator: streamed chunks
  RAGOrchestrator-->>TapioAssistantApp: streamed response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Finntegrate/tapio#1: Shares the same RAG app, orchestrator, and history-aware response flow changes.

Suggested labels

Review effort 4/5

Poem

A rabbit hopped through 3.14 dew,
With paths and history streaming through.
The carrots of code now log and glint,
While vector bunnies leave a citation hint.
Thump-thump—this burrow sings anew 🐇

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: Python 3.14 upgrade, broader Ruff linting, and CI hardening.
Docstring Coverage ✅ Passed Docstring coverage is 96.08% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/upgrade-dependencies-and-documentation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Direct equality comparisons of floats are fragile in general, even
though this particular value passes through a mock unchanged. Use the
idiomatic pytest.approx instead.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (3)
tests/vectorstore/test_chroma_store.py (1)

202-204: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

SonarCloud float-equality hint here is effectively a false positive. score is the literal 0.1 returned by the mock and flows through unchanged (no arithmetic), so the exact comparison is deterministic. Optionally switch to pytest.approx(0.1) to silence the rule.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/vectorstore/test_chroma_store.py` around lines 202 - 204, The Chroma
store test is using a direct float equality check on score, which SonarCloud
flags even though the mock returns the literal value unchanged. Update the
assertion in the test around the results[0] unpacking to use a float-tolerant
comparison such as pytest.approx for score, while keeping the citation_url
assertion unchanged.

Source: Linters/SAST tools

.vscode/extensions.json (1)

2-7: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Recommend Ruff here too.

.devcontainer/devcontainer.json already includes charliermarsh.ruff, but the new workspace recommendations omit it. That leaves non-container contributors without the editor extension this PR standardizes on for lint/format workflows.

Suggested change
     "recommendations": [
         "anthropic.claude-code",
         "ms-vscode-remote.remote-containers",
         "github.vscode-pull-request-github",
-        "ms-python.python"
+        "ms-python.python",
+        "charliermarsh.ruff"
     ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.vscode/extensions.json around lines 2 - 7, The workspace recommendations in
extensions.json are missing the Ruff extension that this PR standardizes on.
Update the recommendations list in the extensions.json configuration to include
charliermarsh.ruff alongside the existing entries so non-container users get the
same lint/format support as the devcontainer setup.
.pre-commit-config.yaml (1)

16-19: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Update pyupgrade to target Python 3.14

.pre-commit-config.yaml:19 still uses --py313-plus; switch it to --py314-plus so pre-commit and CI apply 3.14-specific rewrites.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.pre-commit-config.yaml around lines 16 - 19, The pyupgrade pre-commit hook
is still targeting Python 3.13, so update the hook configuration in
.pre-commit-config.yaml to use the pyupgrade args for Python 3.14 instead.
Locate the pyupgrade hook entry and change the existing --py313-plus setting to
the 3.14 equivalent so pre-commit and CI apply the correct rewrites.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yaml:
- Around line 34-38: The workflow uses floating action tags, so update the
third-party action references in the CI workflow to full commit SHAs instead of
version tags. Replace the existing uses entries such as actions/cache@v4 with
pinned SHAs, and apply the same change to the other external actions in this
workflow so the job definitions remain stable and reproducible.

In `@pyproject.toml`:
- Around line 110-124: The Ruff per-file ignores for tests are only applied to
files directly under tests, so nested test modules still get linted. Update the
[tool.ruff.lint.per-file-ignores] pattern from tests/* to a recursive
tests/**/*.py pattern so all test files under tests, including subdirectories,
inherit the same ignores.

In `@README.md`:
- Line 125: The README sentence in the all-contributors section is grammatically
awkward and should be rewritten. Update the text to use a natural phrasing like
“Contributions of any kind are welcome!” and keep the surrounding
all-contributors description unchanged.

In `@tapio/crawler/crawler.py`:
- Around line 331-338: The containment check in crawler.py should compare path
components rather than string prefixes, since the current resolve() plus
startswith() logic in the path validation block can be bypassed by paths sharing
the same text prefix. Update the check around abs_full_path and abs_output_dir
to use abs_full_path.is_relative_to(abs_output_dir) on Python 3.9+, with a
relative_to() fallback if needed, and keep the existing ValueError path for
rejected URLs.

In `@tapio/parser/parser.py`:
- Around line 558-560: The path normalization in the nested-path branch of
parser.py is removing every ".html" substring instead of only the file
extension. Update the logic around the rel_path handling in the nested path
return so it strips only a trailing ".html" suffix, preserving any mid-path
occurrences, and keep the behavior localized to that nested-path branch.

In `@tapio/services/document_retrieval_service.py`:
- Line 56: The INFO log in DocumentRetrievalService.retrieve_documents is
persisting the raw user-supplied query_text, which should be avoided. Update the
logger.info call to stop printing the full query and instead log a request
identifier plus non-sensitive metadata such as query length or a redacted/hash
value. Keep the change localized to the retrieval logging in
DocumentRetrievalService so the request can still be traced without exposing
PII.

In `@tapio/utils/markdown_utils.py`:
- Line 71: The Python 2-style exception clause is invalid under Python 3 and
prevents module import; update the affected except blocks in markdown_utils and
the matching clause in utils/__init__.py to use a proper multi-exception form.
Locate the bad syntax in the exception-handling logic around the markdown
utility parsing path and any shared import-time helpers, then replace the
comma-separated exception list with a tuple so the modules remain importable.

In `@tapio/utils/text_utils.py`:
- Around line 75-80: In chunk_html_content, validate that chunk_overlap is
strictly smaller than chunk_size before computing estimated_chunks or any
fallback step size. Add an early fail-fast check near the top of the function
(before the chunk estimation logic) that raises ValueError for chunk_overlap >=
chunk_size, so the division and later chunking paths never see a zero or
negative step.

---

Nitpick comments:
In @.pre-commit-config.yaml:
- Around line 16-19: The pyupgrade pre-commit hook is still targeting Python
3.13, so update the hook configuration in .pre-commit-config.yaml to use the
pyupgrade args for Python 3.14 instead. Locate the pyupgrade hook entry and
change the existing --py313-plus setting to the 3.14 equivalent so pre-commit
and CI apply the correct rewrites.

In @.vscode/extensions.json:
- Around line 2-7: The workspace recommendations in extensions.json are missing
the Ruff extension that this PR standardizes on. Update the recommendations list
in the extensions.json configuration to include charliermarsh.ruff alongside the
existing entries so non-container users get the same lint/format support as the
devcontainer setup.

In `@tests/vectorstore/test_chroma_store.py`:
- Around line 202-204: The Chroma store test is using a direct float equality
check on score, which SonarCloud flags even though the mock returns the literal
value unchanged. Update the assertion in the test around the results[0]
unpacking to use a float-tolerant comparison such as pytest.approx for score,
while keeping the citation_url assertion unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b71b1f42-a291-433a-8a0a-d05ea94b788f

📥 Commits

Reviewing files that changed from the base of the PR and between 9eafdbb and c19b70f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (50)
  • .devcontainer/Dockerfile
  • .devcontainer/devcontainer.json
  • .github/workflows/ci.yaml
  • .pre-commit-config.yaml
  • .python-version
  • .vscode/extensions.json
  • README.md
  • mise.toml
  • mypy.ini
  • pyproject.toml
  • pyrefly.toml
  • pytest.ini
  • tapio/app.py
  • tapio/cli.py
  • tapio/config/config_manager.py
  • tapio/config/config_models.py
  • tapio/crawler/__init__.py
  • tapio/crawler/crawler.py
  • tapio/crawler/runner.py
  • tapio/factories.py
  • tapio/models/__init__.py
  • tapio/models/document.py
  • tapio/parser/__init__.py
  • tapio/parser/parser.py
  • tapio/prompts/__init__.py
  • tapio/prompts/prompt_loader.py
  • tapio/services/__init__.py
  • tapio/services/document_retrieval_service.py
  • tapio/services/llm_service.py
  • tapio/services/rag_orchestrator.py
  • tapio/utils/__init__.py
  • tapio/utils/embedding_utils.py
  • tapio/utils/markdown_utils.py
  • tapio/utils/text_utils.py
  • tapio/vectorstore/chroma_store.py
  • tapio/vectorstore/vectorizer.py
  • tests/config/test_config_manager.py
  • tests/crawler/test_crawler.py
  • tests/integration/test_rag_pipeline.py
  • tests/parser/test_parser.py
  • tests/parser/test_relative_links.py
  • tests/prompts/test_prompt_loader.py
  • tests/services/test_document_retrieval_service.py
  • tests/services/test_llm_service.py
  • tests/test_cli.py
  • tests/test_gradio_app.py
  • tests/utils/test_markdown_utils.py
  • tests/utils/test_text_utils.py
  • tests/vectorstore/test_chroma_store.py
  • tests/vectorstore/test_vectorizer.py
💤 Files with no reviewable changes (2)
  • tests/parser/test_relative_links.py
  • pytest.ini

Comment thread .github/workflows/ci.yaml
Comment thread pyproject.toml
Comment on lines +110 to +124
[tool.ruff.lint.per-file-ignores]
"tests/*" = [
"ANN", # tests are excluded from mypy and don't need type annotations
"ARG001", # unused fixture/mock arguments are normal in tests
"ARG002", # unused mock-override arguments are normal in tests
"D", # docstrings aren't required on test functions
"INP001", # tests intentionally use implicit namespace packages
"PLC0415", # local imports for mocking/patching timing are idiomatic in tests
"PLR0913", # test functions naturally take many fixture arguments
"PLR2004", # magic values in test assertions/fixtures are fine
"PT009", # some test classes intentionally use unittest.TestCase-style assertions
"PTH", # os.path-based temp file/dir handling in test fixtures is fine
"S101", # assert is the standard pytest idiom
"SLF001", # tests intentionally exercise private members
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In Ruff, does a per-file-ignorespattern oftests/match nested files liketests/services/test_llm_service.py, or is tests/**/.py required?

💡 Result:

In Ruff, a per-file-ignores pattern of tests/* is generally insufficient for matching files in nested subdirectories because it is a non-recursive, single-level glob pattern [1]. While some users have reported behavior where single-level patterns occasionally match subdirectories depending on specific implementation details or environment conditions [2], the standard and recommended way to ensure a pattern applies to an entire directory tree (including all nested files) is to use the recursive wildcard `` [3][1]. To ensure files like tests/services/test_llm_service.py are matched, you should use: [tool.ruff.lint.per-file-ignores] "tests//.py" = ["<RULE_CODE>"] Using `tests//.py` explicitly enables recursive matching, ensuring that all Python files within the `tests/` directory and its subdirectories are covered [1]. For debugging or verifying whether a specific file is being matched as you expect, you can run `ruff check --show-settings <path/to/file.py>` to see how the patterns are resolved [4].

Citations:


🌐 Web query:

Ruff documentation per-file-ignores glob patterns recursive directory matching tests/* tests/**/*.py

💡 Result:

In Ruff, the per-file-ignores setting supports glob patterns to match files or directories for which specific lint rules should be disabled [1][2]. For recursive directory matching—where you want to apply ignores to all files within a directory and its subdirectories—the standard approach is to use the double wildcard pattern [2][3]. Recommended Patterns: 1. Recursive Matching (Recommended for most test directories): To ignore rules for all Python files throughout an entire directory tree, use the pattern "tests//.py" [2]. [tool.ruff.lint.per-file-ignores] "tests//.py" = ["D"] 2. Non-Recursive Matching: Using "tests/" targets files directly within the "tests/" directory [2]. Depending on your specific Ruff version and configuration, this may not consistently cover nested subdirectories [2][4]. Key Implementation Details: - Location: This setting is typically configured in your pyproject.toml or ruff.toml file under the [tool.ruff.lint.per-file-ignores] section [5][1][2]. - Syntax: The patterns are evaluated as glob patterns [1][2]. You can specify a list of rule codes or prefixes to ignore for the matched files [1]. - Negation: You can use an initial '!' to negate a file pattern if needed [1]. - Consistency: While some simpler glob patterns might behave differently across tool settings (like exclude vs. per-file-ignores), using the explicit recursive syntax (/.py) is the most reliable way to target all files in a directory hierarchy [2][6].

Citations:


🌐 Web query:

site:docs.astral.sh/ruff per-file-ignores glob pattern recursive nested subdirectories tests/*

💡 Result:

To apply per-file-ignores recursively to all files within a tests directory and its nested subdirectories, you should use the globstar pattern in your configuration. In your configuration file (e.g., pyproject.toml or ruff.toml), define the per-file-ignores setting using a pattern like the following [1][2][3]: [tool.ruff.lint.per-file-ignores] "tests/" = ["RULE_CODE"] The pattern ensures that the rule is applied recursively to all files within the tests directory and any of its nested subdirectories [4]. If you need to target specific file types within those directories, you can append the extension, for example: "tests//*.py" = ["RULE_CODE"] This configuration instructs the linter to exclude the specified rule codes for any matching file found at any depth under the tests directory [2].

Citations:


🌐 Web query:

Ruff glob rules recursive pattern per-file-ignores tests/**/*.py tests/** documentation

💡 Result:

In Ruff, the per-file-ignores setting uses glob patterns to identify files for which specific lint rules should be disabled [1]. You can achieve recursive matching using the double-asterisk () wildcard [2]. For recursive matching in your examples, the following patterns are correct: - To target all Python files recursively within a `tests` directory: `"tests//*.py" = [...]` [2]. - To target an entire directory tree (such as `tests` or `documentation`), you can use `"tests/" = [...]` or `"documentation/" = [...]` [3][2]. Key details for configuring `per-file-ignores`: Glob Pattern Behavior: - Single-level vs. Recursive: Patterns like `tests/*` are non-recursive and match only files immediately within the `tests/` directory [2]. Patterns using (e.g., tests//*.py or tests/) are recursive and match the entire directory hierarchy [2]. - Brace Expansion: You can group multiple paths using brace expansion to avoid repetition, for example: "/{tests,documentation}/*" = [...] [3][4]. - Negation: An initial ! negates the pattern [1]. For instance, "!src/.py" = ["D"] ignores D rules everywhere except for the src/ directory [1]. Configuration: - You define these mappings in your pyproject.toml or ruff.toml file under [tool.ruff.lint.per-file-ignores] [1][4]. - Patterns are generally resolved relative to the configuration file [5]. Note on path matching: Ruff's globbing implementation for per-file-ignores may occasionally behave differently than standard exclude patterns regarding directory boundaries [6][7]. If you encounter unexpected matching, verify your settings by running ruff check --show-settings to inspect how the glob patterns are resolved [6][5].

Citations:


🌐 Web query:

Ruff per-file-ignores glob syntax tests/** tests/ tests/**/*.py official docs

💡 Result:

Ruff's per-file-ignores setting supports glob patterns to match files or directories for which specific lint rules should be ignored [1][2][3]. These patterns are defined in your configuration file (e.g., pyproject.toml or ruff.toml) under the [tool.ruff.lint.per-file-ignores] section [1][4][5]. Key syntax points: - Recursive matching: Use to match files recursively through subdirectories [3][5]. For example, "tests//.py" matches all Python files within the tests directory and any of its subdirectories [5]. - Brace expansion: Ruff supports brace expansion, allowing you to group multiple paths or extensions [1][6]. For example, "/{tests,docs,tools}/" ignores violations in any of those directories [1][4]. - Negation: You can use a leading! to negate a pattern [2][3]. This is useful for "ignore everywhere except" scenarios, such as "!src/.py" = ["D"] to ignore Pydocstyle (D) rules in every file except those in the src directory [2][3]. - Path context: Patterns are relative to the project root (the directory containing your configuration file) [7]. - Configuration: You map these patterns to a list of rule codes or prefixes [2][5]. Example configuration (pyproject.toml): [tool.ruff.lint.per-file-ignores] # Ignore D (Pydocstyle) in all files within the tests directory recursively "tests//.py" = ["D"] # Ignore E402 in specific directories using brace expansion "/{tests,docs,tools}/" = ["E402"] # Ignore D everywhere except the src directory "!src/.py" = ["D"] Note that file pattern syntax in Ruff can occasionally behave differently depending on whether you are using per-file-ignores or the top-level exclude setting [8]. It is recommended to verify your configuration using ruff show-settings if you encounter unexpected matching behavior [8].

Citations:


🌐 Web query:

site:docs.astral.sh/ruff per-file-ignores tests/* recursive glob official

💡 Result:

In the context of the Ruff linter, you can use recursive globbing within the per-file-ignores setting [1][2]. Ruff supports standard glob patterns for file matching in the per-file-ignores configuration [3][1]. To ignore rules recursively across all files within a specific directory, you can use the double asterisk globstar pattern [1][2]. For example, to ignore rule E402 for all files within a tests directory and its subdirectories, you can use the following configuration in your pyproject.toml: [tool.ruff.lint.per-file-ignores] "/tests/" = ["E402"] Alternatively, you can group multiple directories: [tool.ruff.lint.per-file-ignores] "/{tests,docs}/" = ["E402"] Note that while the original Flake8 tool also supports per-file-ignores, its pattern matching capabilities are more limited compared to Ruff, and it does not natively support recursive globstar patterns in the same way. If you are specifically looking for recursive support, Ruff is the recommended, modern alternative that provides this functionality [1][2].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '%s\n' '--- pyproject.toml excerpt ---'
sed -n '110,124p' pyproject.toml

printf '\n%s\n' '--- tests tree (depth 3) ---'
find tests -maxdepth 3 -type f | sort | sed -n '1,200p'

Repository: Finntegrate/tapio

Length of output: 1895


🌐 Web query:

site:docs.astral.sh/ruff glob patterns recursive double asterisk per-file-ignores official docs

💡 Result:

In glob patterns, the double asterisk is known as a globstar or recursive wildcard [1]. It matches any directory, subdirectory, or file recursively within the specified path [1]. When used in the per-file-ignores configuration of tools like the Ruff linter, the pattern allows you to apply rule exclusions to files across a directory tree [2][3]. For example, a pattern like "/{tests,docs,tools}/" will match and apply the specified ignore rules to all files nested within the tests, docs, or tools directories, regardless of their depth in the project structure [2][3]. Key points regarding the recursive double asterisk in this context: - Recursive Matching: Unlike a single asterisk (), which typically matches files only within a single directory level, enables deep, recursive traversal of the file system [1]. - Configuration Usage: It is commonly used in configuration files (such as pyproject.toml or ruff.toml) to selectively disable specific linting rules for large portions of a project, such as test suites or generated documentation, without needing to list every individual file or subdirectory [2][3]. - Paths: Patterns are generally treated as relative to the project root, meaning at the start of a pattern refers to the entire directory structure relative to that root [4][2].

Citations:


🌐 Web query:

site:docs.astral.sh/ruff "per-file-ignores" "tests/*" Ruff

💡 Result:

In Ruff, you can use the lint.per-file-ignores setting to disable specific lint rules for files or directories that match a given pattern [1][2]. To ignore rules for files within the tests directory, you can add an entry to your pyproject.toml or ruff.toml file under the [tool.ruff.lint.per-file-ignores] section [1][2]. Example configuration: [tool.ruff.lint.per-file-ignores] "tests/" = ["E402", "F401"] In this example, the rules E402 and F401 will be ignored for all files contained within the tests directory [1][2]. Key details: - Pattern Matching: Ruff uses glob patterns [2]. "tests/" will match files directly inside the tests directory, while "/{tests}/" or "tests//" can be used to match files recursively within the tests directory and its subdirectories [1][2]. - Configuration: This setting must be placed under [tool.ruff.lint.per-file-ignores] if using pyproject.toml [1][2]. - CLI: While there is no dedicated CLI flag for every configuration, you can use the --per-file-ignores flag or the --config flag to override these settings from the command line [1].

Citations:


🌐 Web query:

Ruff glob matching does * match slash per-file-ignores official docs

💡 Result:

In Ruff, the behavior of glob matching—specifically whether the asterisk () character matches path separators (slashes)—can be complex and has been a subject of ongoing discussion and refinement [1][2]. Officially, Ruff's glob implementation uses the globset crate [1]. Historically, the default behavior of the globset library allows * to match path separators, which means a pattern like src/.py might inadvertently match files in subdirectories (e.g., src/subdir/file.py) in addition to files directly within the directory [1][2]. Key points regarding Ruff's glob matching in per-file-ignores: 1. Path Separator Matching: Because * matches path separators by default, patterns intended to be single-level often behave as multi-level, recursive matches [1][2]. Users have reported this as an issue where, for example, src/.py matches files deeper in the directory tree than expected [2]. 2. Limitations and Configuration: There is no simple toggle in the current user-facing configuration to disable this "match-slash" behavior [1]. While internal efforts to apply literal_separator(true) to glob patterns (which restricts * from matching slashes) have been discussed and implemented in some parts of the codebase, behavior may remain inconsistent across different settings like exclude and per-file-ignores [2][3]. 3. Recommended Practice: To avoid ambiguity or unexpected matches, use explicit recursive patterns like /.py if you intend to match all files in a tree, or target files more specifically [4][5]. If you need to match files only at a specific depth, the current implementation's tendency to treat * as matching slashes necessitates caution [2]. Because Ruff's internal glob handling is evolving, and behaviors can vary depending on the specific version and configuration context (e.g., CLI vs. file-based settings), it is best to verify your patterns using the specific Ruff version you are running if you require strict path-level matching [6][7][2].

Citations:


🌐 Web query:

site:docs.astral.sh/ruff/settings/ "per-file-ignores" "tests/*" "subdirectories"

💡 Result:

In tools like Ruff, the per-file-ignores setting supports recursive matching within subdirectories by using glob patterns [1]. To apply ignore rules to all files within a specific directory and its subdirectories, you can use the double-star glob pattern [1]. For example, to ignore specific rule codes for all files within a tests directory and any of its subdirectories, you would use the following configuration in your pyproject.toml or ruff.toml file [1]: [tool.ruff.lint.per-file-ignores] "tests//.py" = ["RULE1", "RULE2"] Using "tests/" typically matches files directly within the tests directory, whereas "tests//*.py" ensures that the pattern recurses into all nested subdirectories [1]. You can also use the! prefix to negate patterns or define complex inclusion/exclusion logic as needed [1].

Citations:


Make the Ruff test ignore recursive. tests/* only matches files directly under tests/; nested test files under tests/config/, tests/services/, etc. still need these ignores. Use tests/**/*.py here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` around lines 110 - 124, The Ruff per-file ignores for tests
are only applied to files directly under tests, so nested test modules still get
linted. Update the [tool.ruff.lint.per-file-ignores] pattern from tests/* to a
recursive tests/**/*.py pattern so all test files under tests, including
subdirectories, inherit the same ignores.

Comment thread README.md Outdated
Comment thread tapio/crawler/crawler.py
Comment thread tapio/parser/parser.py Outdated
Comment thread tapio/services/document_retrieval_service.py Outdated
markdown_files.append(file_path)
except Exception as e:
logger.error(f"Error finding markdown files: {e}")
except ValueError, IndexError:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Critical: invalid except syntax breaks module import.

except ValueError, IndexError: is Python 2 syntax and raises a SyntaxError under Python 3 (parsed as except ValueError as IndexError, which is itself invalid here). This makes the entire markdown_utils module unimportable, which cascades into the vectorizer flow and every test importing it. Wrap the exception types in a tuple.

🐛 Proposed fix
-                except ValueError, IndexError:
+                except (ValueError, IndexError):
                     # Skip files that don't match the site structure
                     pass

The graph context shows the same construct in tapio/utils/__init__.py. Please verify and fix it there too:

#!/bin/bash
# Find the invalid Python 2-style except clause across the repo
rg -nP 'except\s+\w+\s*,\s*\w+\s*:' --type=py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tapio/utils/markdown_utils.py` at line 71, The Python 2-style exception
clause is invalid under Python 3 and prevents module import; update the affected
except blocks in markdown_utils and the matching clause in utils/__init__.py to
use a proper multi-exception form. Locate the bad syntax in the
exception-handling logic around the markdown utility parsing path and any shared
import-time helpers, then replace the comma-separated exception list with a
tuple so the modules remain importable.

Comment thread tapio/utils/text_utils.py
- crawler.py: fix path-traversal check that used string-prefix matching
  on resolved paths (abs_full_path.startswith(abs_output_dir)), which a
  sibling directory sharing the same prefix could bypass. Use
  Path.is_relative_to() instead.
- parser.py: stop replacing every ".html" substring in the output path
  with str.replace(); strip only the trailing extension via
  Path.with_suffix("").
- document_retrieval_service.py: stop logging raw user query text at
  INFO level; log only its length.
- text_utils.py: fail fast with ValueError when chunk_overlap >=
  chunk_size, instead of letting it produce a zero/negative chunking
  step size downstream.
- ci.yaml: pin third-party actions (checkout, setup-uv, cache) to full
  commit SHAs instead of floating version tags, for supply-chain
  stability.
- README.md: fix awkward grammar in the all-contributors sentence.
- .vscode/extensions.json: recommend the Ruff extension, matching the
  linter this project now standardizes on.

Verified and left unchanged (false positives / not actionable):
- pyproject.toml per-file-ignores "tests/*": empirically confirmed this
  already matches nested test files (e.g. tests/config/test_*.py) --
  ruff's glob matching for per-file-ignores isn't shell-glob semantics.
- markdown_utils.py "except ValueError, IndexError:": valid Python 3.14
  syntax (PEP 758 parenthesis-free multi-exception except), not Python
  2-style. Confirmed the module imports and the test suite exercises it
  successfully; this project requires Python >=3.14.
- .pre-commit-config.yaml pyupgrade --py313-plus: the pinned pyupgrade
  version (3.21.2) has no --py314-plus flag yet; --py313-plus is its
  highest supported target.
- test_chroma_store.py float comparison: already fixed in a prior
  commit (uses pytest.approx).
@brylie

brylie commented Jun 27, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai

Verified as false positives / not actionable, left unchanged:

tests/* per-file-ignore pattern — empirically confirmed it already matches nested test files; ruff's glob semantics aren't shell-glob
except ValueError, IndexError: in markdown_utils.py — valid Python 3.14 PEP 758 syntax, not Python 2-style; confirmed via successful import and test execution
pyupgrade --py314-plus — that flag doesn't exist in the pinned pyupgrade version (3.21.2 tops out at --py313-plus)
test_chroma_store.py float comparison — already fixed in a prior commit

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@brylie Thanks for verifying these. Agreed — those are false positives / not actionable for this PR, and the stale test_chroma_store.py note was already handled in an earlier commit. I’m withdrawing those concerns.

🐇📝


✏️ Learnings added
Learnt from: brylie
URL: https://github.com/Finntegrate/tapio/pull/5

Timestamp: 2026-06-27T12:32:50.615Z
Learning: In this repository's `pyproject.toml`, Ruff `per-file-ignores` patterns such as `tests/*` should be interpreted using Ruff's own glob semantics, which already match nested files under `tests/`; do not assume shell-glob behavior when reviewing these patterns.

Learnt from: brylie
URL: https://github.com/Finntegrate/tapio/pull/5

Timestamp: 2026-06-27T12:32:50.615Z
Learning: In `tapio/utils/markdown_utils.py`, unparenthesized multiple exception types in an `except` clause (for example, `except ValueError, IndexError:`) are valid Python 3.14 syntax under PEP 758; do not flag this as Python 2-style syntax in this repository.

Learnt from: brylie
URL: https://github.com/Finntegrate/tapio/pull/5

Timestamp: 2026-06-27T12:32:50.615Z
Learning: This repository pins `pyupgrade` to version `3.21.2`, whose highest available Python target flag is `--py313-plus`; do not suggest `--py314-plus` unless the pinned `pyupgrade` version is upgraded to one that supports it.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@brylie brylie merged commit ee69a63 into main Jun 27, 2026
3 checks passed
@brylie brylie deleted the feature/upgrade-dependencies-and-documentation branch June 27, 2026 12:34
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant