Skip to content

[bug] Sequential Processing due to Connection Pool Limits #2594

@justinthelaw

Description

@justinthelaw

Connection Pool Limits Cause Sequential Processing Instead of Concurrent Execution

Summary

BAML appears to have connection pool limits that cause high-concurrency requests to be processed sequentially rather than concurrently, despite correct usage of asyncio.gather(). This manifests as a distinctive timing pattern where requests complete in sequential batches rather than truly in parallel.

Environment

  • BAML Version: 0.208.5 (latest as of issue creation: 0.211.0)
  • Python Version: 3.12.5
  • OS: macOS
  • Usage Pattern: 20+ concurrent requests via asyncio.gather()

Issue Details

Expected Behavior

When making multiple concurrent BAML calls with asyncio.gather(), requests should execute in parallel with completion times distributed based on actual API response times.

Actual Behavior

Requests are processed in sequential batches (~6 at a time), creating this pattern:

  1. First ~6 requests: Complete sequentially with 1.5-2s gaps between each
  2. Sudden burst: 6+ requests complete within milliseconds of each other
  3. Pattern repeats: Indicating connection pool cycling rather than true concurrency

Evidence from Production Logs

Sequential Processing Phase:

2025-10-08 17:22:26,888 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:28,519 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.631s]
2025-10-08 17:22:30,228 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.709s] 
2025-10-08 17:22:31,689 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.461s]
2025-10-08 17:22:33,466 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.777s]
2025-10-08 17:22:35,298 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.832s]

Then Sudden Concurrent Burst:

2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 0ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 0ms]
2025-10-08 17:22:47,932 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1ms]
2025-10-08 17:22:47,933 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1ms]

User Code (Correctly Implemented)

async def concurrent_simplified_generation(queries, context_chunks_list, baml_options):
    """From backend/backend/core/agents/helpers.py - correctly uses asyncio.gather"""
    tasks = []
    for query, context_chunks in zip(queries, context_chunks_list, strict=True):
        task = simplified_baml_qa_response(query, ..., baml_options=baml_options)
        tasks.append(task)
    
    # This should enable true concurrency, but BAML appears to serialize internally
    return await asyncio.gather(*tasks)

Relationship to Previous Work

Acknowledgment: The BAML team has already addressed several connection pool issues:

This issue is different:

  • Previous fixes addressed idle connections and resource leaks
  • This issue is about active connection limits preventing true concurrency
  • The distinctive timing pattern suggests connection pool exhaustion rather than idle timeouts

Root Cause Analysis

BAML uses requests/httpx internally but appears to have connection pool limits that aren't suitable for high-concurrency scenarios. The current configuration likely allows ~6 concurrent connections, causing additional requests to queue rather than execute in parallel.

Impact

  • Performance degradation: 20 concurrent requests that should complete in ~3-5s take 30-50s
  • Poor resource utilization: CPU and network remain idle while requests queue
  • Unpredictable latency: Request completion depends on queue position, not actual processing

Proposed Solutions

  1. Expose connection pool configuration in BAML client options
  2. Increase default connection limits for modern high-concurrency use cases
  3. Add configuration similar to the existing timeout proposal in Feature Proposal: Configurable LLM Client Timeouts #1630

Additional Context

  • Issue becomes pronounced with 10+ concurrent requests
  • BAML version 0.208.5, but reviewing through 0.211.0 shows no related fixes
  • This significantly impacts batch processing and parallel generation workflows
  • Related to Feature Proposal: Configurable LLM Client Timeouts #1630 (configurable timeouts) but specifically about connection limits

Reproducible: Yes, consistently observed across multiple test runs and production usage

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions