Skip to content

feat: Async parser + release GIL on pybind functions #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

cau-git
Copy link
Contributor

@cau-git cau-git commented Jun 24, 2025

This establishes an async parser interface so documents and pages can be loaded asynchronously. The GIL is released on the relevant pybind methods to allow true parallelism when this is run in an asyncio loop.

⚠️ The current C++ implementation is not fully threadsafe and therefore segfaults. Below tests expose the problem:

# 1. Async sequential
pytest tests/test_parse.py::test_async_sequential_page_loading_sync_wrapper -s -v #works, only uses the async interface sequentially

# 2. Async parallel
pytest tests/test_parse.py::test_async_parallel_page_loading_sync_wrapper -v -s # breaks, tries to use the async interface with actual concurrency

Expected failure

tests/test_parse.py::test_async_parallel_page_loading_sync_wrapper Document loaded successfully with 9 pages
Attempting parallel page loading (expected to trigger thread-safety issues)...
Created 9 parallel tasks for pages [1, 2, 3, 4, 5, 6, 7, 8, 9]
Executing parallel page loading (this may crash due to C-backend thread-safety issues)...
Python(70213,0x17014b000) malloc: *** error for object 0x125e925d8: pointer being freed was not allocated
Python(70213,0x17014b000) malloc: *** set a breakpoint in malloc_error_break to debug
Fatal Python error: Aborted

Thread 0x0000000172163000 (most recent call first):
  <no Python frame>

Thread 0x0000000171157000 (most recent call first):
  File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/selector_events.py", line 152 in _write_to_self
  File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 847 in call_soon_threadsafe
  File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 407 in _call_set_state
  File ???Fatal Python error: Segmentation fault

Task

  • Find out where the thread-safety is violated in the C++ backend. It could be related to the font cache methods.

Reminder: Update the dev environment

docling-parse was just switched to uv, hence a few setup commands need to be redone. Kill the old venv before and be sure to have a fresh shell without any loaded poetry env.

uv venv venv --python 3.12
source venv/bin/activate

uv sync --all-extras # install the new deps
python -m build # do not forget, uv sync will not build the binaries, unlike poetry install.

Copy link

mergify bot commented Jun 24, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Copy link
Contributor

github-actions bot commented Jun 24, 2025

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@cau-git cau-git requested a review from PeterStaar-IBM June 24, 2025 13:44
Signed-off-by: Christoph Auer <[email protected]>
@cau-git cau-git added the enhancement New feature or request label Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants