-
Notifications
You must be signed in to change notification settings - Fork 311
Description
Connection Pool Limits Cause Sequential Processing Instead of Concurrent Execution
Summary
BAML appears to have connection pool limits that cause high-concurrency requests to be processed sequentially rather than concurrently, despite correct usage of asyncio.gather()
. This manifests as a distinctive timing pattern where requests complete in sequential batches rather than truly in parallel.
Environment
- BAML Version: 0.208.5 (latest as of issue creation: 0.211.0)
- Python Version: 3.12.5
- OS: macOS
- Usage Pattern: 20+ concurrent requests via
asyncio.gather()
Issue Details
Expected Behavior
When making multiple concurrent BAML calls with asyncio.gather()
, requests should execute in parallel with completion times distributed based on actual API response times.
Actual Behavior
Requests are processed in sequential batches (~6 at a time), creating this pattern:
- First ~6 requests: Complete sequentially with 1.5-2s gaps between each
- Sudden burst: 6+ requests complete within milliseconds of each other
- Pattern repeats: Indicating connection pool cycling rather than true concurrency
Evidence from Production Logs
Sequential Processing Phase:
2025-10-08 17:22:26,888 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:28,519 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.631s]
2025-10-08 17:22:30,228 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.709s]
2025-10-08 17:22:31,689 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.461s]
2025-10-08 17:22:33,466 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.777s]
2025-10-08 17:22:35,298 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1.832s]
Then Sudden Concurrent Burst:
2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 0ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 0ms]
2025-10-08 17:22:47,932 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1ms]
2025-10-08 17:22:47,933 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK" [Gap: 1ms]
User Code (Correctly Implemented)
async def concurrent_simplified_generation(queries, context_chunks_list, baml_options):
"""From backend/backend/core/agents/helpers.py - correctly uses asyncio.gather"""
tasks = []
for query, context_chunks in zip(queries, context_chunks_list, strict=True):
task = simplified_baml_qa_response(query, ..., baml_options=baml_options)
tasks.append(task)
# This should enable true concurrency, but BAML appears to serialize internally
return await asyncio.gather(*tasks)
Relationship to Previous Work
Acknowledgment: The BAML team has already addressed several connection pool issues:
- PR Add limit on connection pool to prevent stalling issues in pyo3 and other ffi boundaries #1027/set pool timeout for all clients #1028: Fixed idle connection stalling in FFI boundaries
- PR Add a pool timeout to try to fix open File descriptor issue like deno #2205: Fixed file descriptor leaks with pool timeouts
This issue is different:
- Previous fixes addressed idle connections and resource leaks
- This issue is about active connection limits preventing true concurrency
- The distinctive timing pattern suggests connection pool exhaustion rather than idle timeouts
Root Cause Analysis
BAML uses requests/httpx internally but appears to have connection pool limits that aren't suitable for high-concurrency scenarios. The current configuration likely allows ~6 concurrent connections, causing additional requests to queue rather than execute in parallel.
Impact
- Performance degradation: 20 concurrent requests that should complete in ~3-5s take 30-50s
- Poor resource utilization: CPU and network remain idle while requests queue
- Unpredictable latency: Request completion depends on queue position, not actual processing
Proposed Solutions
- Expose connection pool configuration in BAML client options
- Increase default connection limits for modern high-concurrency use cases
- Add configuration similar to the existing timeout proposal in Feature Proposal: Configurable LLM Client Timeouts #1630
Additional Context
- Issue becomes pronounced with 10+ concurrent requests
- BAML version 0.208.5, but reviewing through 0.211.0 shows no related fixes
- This significantly impacts batch processing and parallel generation workflows
- Related to Feature Proposal: Configurable LLM Client Timeouts #1630 (configurable timeouts) but specifically about connection limits
Reproducible: Yes, consistently observed across multiple test runs and production usage