Skip to content

itsaryanchauhan/Winnow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Winnow 🌾

Open-source RAG prompt compression middleware. Keep the signal. Drop the noise.

Built with FastAPI, LLMLingua-2, and tiktoken. MIT licensed, self-hostable, pip installable.

🌐 trywinnow.vercel.app Β· πŸ“¦ PyPI Β· πŸ€— HuggingFace Space Β· ⭐ GitHub


🎯 What is Winnow?

Winnow sits between your vector database and your LLM. It takes raw retrieved document chunks, compresses them using LLMLingua-2 token-level scoring guided by your query, and returns a shorter context that preserves answer-relevant content - cutting token costs by ~50% with less than 3% accuracy loss.

✨ Key Features

  • πŸ—œοΈ Token Compression: Cuts retrieved context by ~50% using LLMLingua-2
  • 🎯 Query-Guided: Compression is steered by your question - relevant tokens survive
  • πŸ”’ Protected Words: Mark phrases that must never be removed
  • βš–οΈ Ratio Control: Tune aggressiveness from 0.1 (light) to 0.9 (heavy)
  • πŸ”Œ OpenAI-Compatible Proxy: Drop-in /v1/chat/completions - zero code changes
  • 🦜 LangChain Integration: Native WinnowRetriever drop-in wrapper
  • 🐳 Self-Hostable: Single Docker command, no API key required
  • πŸ“¦ Pip Installable: pip install winnow-compress

πŸ“Š Benchmarks

Tested on SQuAD with LLMLingua-2. Baseline F1: 78.4. Avg latency: ~85ms.

Preset Ratio Tokens In Tokens Out Reduction F1 Score F1 Drop
Light 0.7 420 294 ~30% 77.6 <1 pt
Balanced 0.5 420 210 ~50% 76.1 2.3 pt
Aggressive 0.3 420 147 ~65% 73.4 5.0 pt

πŸš€ Quick Start

# Self-host in one command
docker run -p 8000:8000 itsaryanchauhan/winnow

API live at http://localhost:8000 Β· Docs at http://localhost:8000/docs

πŸ“– Full Integration Examples (Docker, pip, LangChain, REST, OpenAI Proxy)

Option 1 - Self-host with Docker

docker run -p 8000:8000 itsaryanchauhan/winnow

Option 2 - pip install

pip install winnow-compress
from winnow import Winnow

client = Winnow()

result = client.compress(
    text=input_text,
    compression_ratio=0.5,
    rag_mode=True,
    question="What is the warranty period?"
)

print(result["output"])
print(result["original_tokens"])     # e.g. 420
print(result["compressed_tokens"])   # e.g. 210
print(result["ratio"])               # e.g. 0.5
print(result["estimated_savings_usd"])    # e.g. 0.000525

Option 3 - LangChain Drop-in

from winnow.langchain import WinnowRetriever

retriever = WinnowRetriever(
    base_retriever,
    compression_ratio=0.5
)
docs = retriever.get_relevant_documents("your question")

Option 4 - REST API (curl)

curl -X POST http://localhost:8000/v1/compress \
  -H "Content-Type: application/json" \
  -d '{
    "input": "your retrieved chunks here",
    "compression_ratio": 0.5,
    "rag_mode": true,
    "question": "what is the capital of France?",
    "protected_strings": ["Paris", "France"]
  }'

Option 5 - OpenAI-Compatible Proxy

Zero code changes if you already use the OpenAI SDK - just swap the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": your_prompt}]
)

Request Parameters

Field Type Required Description
input string βœ… Text or context to compress
question string ❌ Optional query for RAG-guided compression
compression_ratio float ❌ Compression ratio 0.1–0.9. Default: 0.5
protected_strings string[] ❌ Words/phrases that must not be removed
rag_mode boolean ❌ Enable question-guided compression. Default: false

Response Fields

Field Type Description
output string The compressed output
original_tokens int Token count before compression
compressed_tokens int Token count after compression
ratio float Actual ratio achieved
estimated_savings_usd float Estimated USD saved (gpt-4o pricing)

Batch Compression

curl -X POST http://localhost:8000/v1/compress/batch \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": ["chunk one...", "chunk two..."],
    "compression_ratio": 0.5,
    "rag_mode": true,
    "question": "your query"
  }'

πŸ“ Project Structure

Winnow/
β”œβ”€β”€ app/               # FastAPI application
β”‚   └── main.py        # API routes and server
β”œβ”€β”€ winnow/            # pip package
β”‚   β”œβ”€β”€ client.py      # Winnow
β”‚   └── langchain.py   # WinnowRetriever
β”œβ”€β”€ benchmarks/        # SQuAD benchmark scripts and results
β”œβ”€β”€ tests/             # Test suite
β”œβ”€β”€ website/           # Next.js website (trywinnow.vercel.app)
β”œβ”€β”€ Dockerfile
└── pyproject.toml

πŸ”Œ API Endpoints

Method Endpoint Description
POST /v1/compress Compress a single context
POST /v1/compress/batch Compress multiple contexts
POST /v1/chat/completions OpenAI-compatible proxy with auto-compression
GET /health Health check

πŸ› οΈ Built With

  • API: FastAPI + Python
  • Compression: microsoft/llmlingua-2-xlm-roberta-large-meetingbank
  • Tokenizer: tiktoken (cl100k_base)
  • Deploy: Docker + HuggingFace Spaces

πŸ‘€ Author

Created by Aryan Chauhan (@itsaryanchauhan)

πŸ“ž Get in Touch

Have questions or suggestions? Open an issue on the GitHub repository.


MIT License Β· Live Demo Β· HuggingFace Space

About

Keep the signal. Drop the noise.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors