One model is an opinion. Three is research.
Quick Start • How It Works • Scoring • Benchmarks • Configuration
Artemis Consensus is a multi-LLM ensemble system that doesn't trust any single AI model to give you the right answer.
Instead, it does something smarter:
- Asks three models the same question — Llama 3 (via Groq), Gemini Flash, and Qwen-72B (via HuggingFace) — all at the same time.
- Has each model critique the others — every model scores and reviews the other models' answers across five quality dimensions.
- Detects when the models disagree — using TF-IDF cosine similarity to flag conflicting responses before you ever see them.
- Synthesizes one final answer — pulling the best, most accurate parts from each response, fixing errors the critiques found, and citing which model contributed what.
The result is a single answer you can actually trust, with a confidence label and full transparency into how it was built.
- Python 3.10+
- API keys for Groq, Google Gemini, and HuggingFace
git clone https://github.com/umran666/Artemis-Consensus.git
cd Artemis-Consensus
pip install -r requirements.txtCreate a .env file in the project root:
GROQ_API_KEY=your_groq_key_here
GEMINI_API_KEY=your_gemini_key_here
HUGGINGFACE_API_KEY=your_huggingface_key_herestreamlit run app.pyThe sidebar will show you a live status of which API keys are connected before you run anything.
All three models receive your question simultaneously via asyncio.gather(). No model waits for another — they run in parallel for speed.
| Model | Provider | Default Model ID |
|---|---|---|
| Llama 3 (70B) | Groq | llama-3.3-70b-versatile |
| Gemini Flash Lite | gemini-2.5-flash-lite |
|
| Qwen 72B Instruct | HuggingFace | Qwen/Qwen2.5-72B-Instruct |
Before critiquing, the system checks: do the models even agree?
It vectorizes all answers using TF-IDF, computes pairwise cosine similarity, and flags a disagreement if the average similarity drops below 0.6. When this happens, you'll see a warning in the UI — a heads-up that the question might be contentious or nuanced.
This is where Artemis gets interesting. Each model evaluates the other models' answers — not its own. The critique produces a structured JSON score across five dimensions (more on that below), plus written feedback identifying the biggest weakness.
The scores from all critics are averaged per target model, giving you a fair, multi-perspective evaluation.
The highest-scoring answers are fed to a synthesizer model along with all critique feedback. It produces one final answer that:
- Pulls the most accurate parts from each response
- Fixes factual errors identified during critique
- Includes inline citations like
[Llama 3 (Groq)]so you know where each fact came from - Treats all model answers as untrusted evidence — prompt injection inside model responses is explicitly rejected
Every answer is evaluated on five dimensions, each weighted to reflect its importance:
| Dimension | Weight | What It Measures |
|---|---|---|
| Factuality | 35% | Are the claims accurate and well-supported? |
| Completeness | 20% | Does it fully address the question? |
| Confidence | 15% | Is the confidence level appropriate — not over or under? |
| Consistency | 15% | Is the answer internally coherent? |
| Reasoning | 15% | Is the logic sound and well-structured? |
These combine into a weighted total that maps to a confidence label:
| Score Range | Label |
|---|---|
| ≥ 0.80 | 🟢 High Confidence |
| ≥ 0.60 | 🟡 Medium Confidence |
| ≥ 0.40 | 🟠 Low Confidence |
| < 0.40 | 🔴 Needs Verification |
Here's an actual example of Artemis processing a contentious question through all three models:
Question: "What causes more deaths annually — sharks or vending machines?"
Artemis Final Answer: 🟢 High Confidence (0.87)
Vending machines cause more deaths annually than sharks
[Gemini Flash]. While shark attacks are rare and highly publicized, they result in an average of fewer than 10 human fatalities worldwide each year[Qwen-72B]. In contrast, vending machine accidents, often due to tip-overs or other accidents, can lead to more deaths[Llama 3 (Groq)]. Estimates suggest that in the United States alone, vending machines cause about 2-3 deaths per year[Llama 3 (Groq)], and globally, the number of vending machine-related deaths could be higher[Llama 3 (Groq)]. It's essential to note that both shark attacks and vending machine accidents are extremely rare, and the likelihood of being killed by either is extremely low[Llama 3 (Groq)].
Each model is scored across the five quality dimensions by the other models. The radar chart below shows how scores compare at a glance:
The ensemble consistently outperforms any single model. Below is a sample TruthfulQA run showing individual model accuracy versus the synthesized Artemis answer:
Key takeaway: The ensemble doesn't just pick the best model — it combines the strengths of all three, producing answers that are more accurate than any individual contributor.
The interface is organized into five tabs:
| Tab | What You See |
|---|---|
| Final Answer | The synthesized answer with a confidence badge and disagreement warning |
| Model Responses | Each model's raw output with success/failure status and latency |
| Critique | Written feedback from each critic explaining weaknesses |
| Scores | Heatmap table + line chart comparing all models across the five dimensions |
| Metrics | Total latency, total tokens, and a per-model performance breakdown |
You can evaluate the ensemble against standard datasets from the command line:
# TruthfulQA — tests factual accuracy
python run_benchmarks.py --benchmark_name truthfulqa --samples 100
# GSM8K — tests mathematical reasoning
python run_benchmarks.py --benchmark_name gsm8k --samples 50Results are saved as CSV files to data/results/ with a summary printed to the console, including per-model and ensemble accuracy.
Artemis-Research/
│
├── app.py # Streamlit UI entry point
├── config.yaml # All tuneable parameters
├── requirements.txt # Python dependencies
├── run_benchmarks.py # CLI benchmark runner
├── .env # API keys (not committed)
│
├── Models/
│ ├── base.py # Abstract base class + ModelResponse
│ ├── groq.py # Groq / Llama 3 integration
│ ├── gemini.py # Google Gemini integration
│ └── huggingface.py # HuggingFace Inference API
│
├── pipeline/
│ ├── ensemble.py # Core orchestrator — runs the full pipeline
│ ├── critique.py # Cross-critique logic and prompt engineering
│ ├── scorer.py # Score dataclass and JSON response parser
│ └── synthesis.py # Final answer generation with citations
│
├── evaluation/
│ └── benchmark.py # TruthfulQA & GSM8K evaluation harness
│
├── data/ # Benchmark output directory
├── images/ # Architecture diagrams and assets
├── logs/ # Runtime logs
└── utils/ # Extensible utility modules
Everything tuneable lives in config.yaml:
pipeline:
mode: cross_critique # critique strategy
disagreement_threshold: 0.6 # cosine similarity cutoff
timeout_seconds: 30 # per-model timeout
retry_attempts: 2 # retries on failure
scoring:
weights:
factuality: 0.35
confidence: 0.15
completeness: 0.20
consistency: 0.15
reasoning: 0.15
evaluation:
benchmarks: [truthfulqa, gsm8k]
sample_size: 100
output_dir: data/results| What | Why |
|---|---|
| Streamlit | Interactive research UI without frontend overhead |
| asyncio | Parallel model calls — no waiting in sequence |
| scikit-learn | TF-IDF + cosine similarity for disagreement detection |
| Matplotlib & Pandas | Score visualisation and data handling |
| HuggingFace Datasets | Standard benchmark datasets (TruthfulQA, GSM8K) |
| Loguru | Structured logging |
| python-dotenv | Secure API key management |
This project is intended for research and educational purposes.
Artemis Consensus — because the best answer is the one that survives peer review.

