Open-source RAG prompt compression middleware. Keep the signal. Drop the noise.
Built with FastAPI, LLMLingua-2, and tiktoken. MIT licensed, self-hostable, pip installable.
π trywinnow.vercel.app Β· π¦ PyPI Β· π€ HuggingFace Space Β· β GitHub
Winnow sits between your vector database and your LLM. It takes raw retrieved document chunks, compresses them using LLMLingua-2 token-level scoring guided by your query, and returns a shorter context that preserves answer-relevant content - cutting token costs by ~50% with less than 3% accuracy loss.
- ποΈ Token Compression: Cuts retrieved context by ~50% using LLMLingua-2
- π― Query-Guided: Compression is steered by your question - relevant tokens survive
- π Protected Words: Mark phrases that must never be removed
- βοΈ Ratio Control: Tune aggressiveness from 0.1 (light) to 0.9 (heavy)
- π OpenAI-Compatible Proxy: Drop-in
/v1/chat/completions- zero code changes - π¦ LangChain Integration: Native
WinnowRetrieverdrop-in wrapper - π³ Self-Hostable: Single Docker command, no API key required
- π¦ Pip Installable:
pip install winnow-compress
Tested on SQuAD with LLMLingua-2. Baseline F1: 78.4. Avg latency: ~85ms.
| Preset | Ratio | Tokens In | Tokens Out | Reduction | F1 Score | F1 Drop |
|---|---|---|---|---|---|---|
| Light | 0.7 | 420 | 294 | ~30% | 77.6 | <1 pt |
| Balanced | 0.5 | 420 | 210 | ~50% | 76.1 | 2.3 pt |
| Aggressive | 0.3 | 420 | 147 | ~65% | 73.4 | 5.0 pt |
# Self-host in one command
docker run -p 8000:8000 itsaryanchauhan/winnowAPI live at http://localhost:8000 Β· Docs at http://localhost:8000/docs
π Full Integration Examples (Docker, pip, LangChain, REST, OpenAI Proxy)
docker run -p 8000:8000 itsaryanchauhan/winnowpip install winnow-compressfrom winnow import Winnow
client = Winnow()
result = client.compress(
text=input_text,
compression_ratio=0.5,
rag_mode=True,
question="What is the warranty period?"
)
print(result["output"])
print(result["original_tokens"]) # e.g. 420
print(result["compressed_tokens"]) # e.g. 210
print(result["ratio"]) # e.g. 0.5
print(result["estimated_savings_usd"]) # e.g. 0.000525from winnow.langchain import WinnowRetriever
retriever = WinnowRetriever(
base_retriever,
compression_ratio=0.5
)
docs = retriever.get_relevant_documents("your question")curl -X POST http://localhost:8000/v1/compress \
-H "Content-Type: application/json" \
-d '{
"input": "your retrieved chunks here",
"compression_ratio": 0.5,
"rag_mode": true,
"question": "what is the capital of France?",
"protected_strings": ["Paris", "France"]
}'Zero code changes if you already use the OpenAI SDK - just swap the base URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": your_prompt}]
)| Field | Type | Required | Description |
|---|---|---|---|
input |
string | β | Text or context to compress |
question |
string | β | Optional query for RAG-guided compression |
compression_ratio |
float | β | Compression ratio 0.1β0.9. Default: 0.5 |
protected_strings |
string[] | β | Words/phrases that must not be removed |
rag_mode |
boolean | β | Enable question-guided compression. Default: false |
| Field | Type | Description |
|---|---|---|
output |
string | The compressed output |
original_tokens |
int | Token count before compression |
compressed_tokens |
int | Token count after compression |
ratio |
float | Actual ratio achieved |
estimated_savings_usd |
float | Estimated USD saved (gpt-4o pricing) |
curl -X POST http://localhost:8000/v1/compress/batch \
-H "Content-Type: application/json" \
-d '{
"inputs": ["chunk one...", "chunk two..."],
"compression_ratio": 0.5,
"rag_mode": true,
"question": "your query"
}'Winnow/
βββ app/ # FastAPI application
β βββ main.py # API routes and server
βββ winnow/ # pip package
β βββ client.py # Winnow
β βββ langchain.py # WinnowRetriever
βββ benchmarks/ # SQuAD benchmark scripts and results
βββ tests/ # Test suite
βββ website/ # Next.js website (trywinnow.vercel.app)
βββ Dockerfile
βββ pyproject.toml
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/compress |
Compress a single context |
POST |
/v1/compress/batch |
Compress multiple contexts |
POST |
/v1/chat/completions |
OpenAI-compatible proxy with auto-compression |
GET |
/health |
Health check |
- API: FastAPI + Python
- Compression: microsoft/llmlingua-2-xlm-roberta-large-meetingbank
- Tokenizer: tiktoken (cl100k_base)
- Deploy: Docker + HuggingFace Spaces
Created by Aryan Chauhan (@itsaryanchauhan)
Have questions or suggestions? Open an issue on the GitHub repository.
MIT License Β· Live Demo Β· HuggingFace Space