Artemis Consensus

One model is an opinion. Three is research.

Quick Start • How It Works • Scoring • Benchmarks • Configuration

What Is This?

Artemis Consensus is a multi-LLM ensemble system that doesn't trust any single AI model to give you the right answer.

Instead, it does something smarter:

Asks three models the same question — Llama 3 (via Groq), Gemini Flash, and Qwen-72B (via HuggingFace) — all at the same time.
Has each model critique the others — every model scores and reviews the other models' answers across five quality dimensions.
Detects when the models disagree — using TF-IDF cosine similarity to flag conflicting responses before you ever see them.
Synthesizes one final answer — pulling the best, most accurate parts from each response, fixing errors the critiques found, and citing which model contributed what.

The result is a single answer you can actually trust, with a confidence label and full transparency into how it was built.

Quick Start

Prerequisites

Python 3.10+
API keys for Groq, Google Gemini, and HuggingFace

Installation

git clone https://github.com/umran666/Artemis-Consensus.git
cd Artemis-Consensus
pip install -r requirements.txt

Environment Setup

Create a .env file in the project root:

GROQ_API_KEY=your_groq_key_here
GEMINI_API_KEY=your_gemini_key_here
HUGGINGFACE_API_KEY=your_huggingface_key_here

Launch

streamlit run app.py

The sidebar will show you a live status of which API keys are connected before you run anything.

How It Works

Step 1 — Parallel Generation

All three models receive your question simultaneously via asyncio.gather(). No model waits for another — they run in parallel for speed.

Model	Provider	Default Model ID
Llama 3 (70B)	Groq	`llama-3.3-70b-versatile`
Gemini Flash Lite	Google	`gemini-2.5-flash-lite`
Qwen 72B Instruct	HuggingFace	`Qwen/Qwen2.5-72B-Instruct`

Step 2 — Disagreement Detection

Before critiquing, the system checks: do the models even agree?

It vectorizes all answers using TF-IDF, computes pairwise cosine similarity, and flags a disagreement if the average similarity drops below 0.6. When this happens, you'll see a warning in the UI — a heads-up that the question might be contentious or nuanced.

Step 3 — Cross-Critique

This is where Artemis gets interesting. Each model evaluates the other models' answers — not its own. The critique produces a structured JSON score across five dimensions (more on that below), plus written feedback identifying the biggest weakness.

The scores from all critics are averaged per target model, giving you a fair, multi-perspective evaluation.

Step 4 — Synthesis

The highest-scoring answers are fed to a synthesizer model along with all critique feedback. It produces one final answer that:

Pulls the most accurate parts from each response
Fixes factual errors identified during critique
Includes inline citations like [Llama 3 (Groq)] so you know where each fact came from
Treats all model answers as untrusted evidence — prompt injection inside model responses is explicitly rejected

Scoring

Every answer is evaluated on five dimensions, each weighted to reflect its importance:

Dimension	Weight	What It Measures
Factuality	35%	Are the claims accurate and well-supported?
Completeness	20%	Does it fully address the question?
Confidence	15%	Is the confidence level appropriate — not over or under?
Consistency	15%	Is the answer internally coherent?
Reasoning	15%	Is the logic sound and well-structured?

These combine into a weighted total that maps to a confidence label:

Score Range	Label
≥ 0.80	🟢 High Confidence
≥ 0.60	🟡 Medium Confidence
≥ 0.40	🟠 Low Confidence
< 0.40	🔴 Needs Verification

Sample Output

Here's an actual example of Artemis processing a contentious question through all three models:

Question: "What causes more deaths annually — sharks or vending machines?"

Artemis Final Answer: 🟢 High Confidence (0.87)

Vending machines cause more deaths annually than sharks [Gemini Flash]. While shark attacks are rare and highly publicized, they result in an average of fewer than 10 human fatalities worldwide each year [Qwen-72B]. In contrast, vending machine accidents, often due to tip-overs or other accidents, can lead to more deaths [Llama 3 (Groq)]. Estimates suggest that in the United States alone, vending machines cause about 2-3 deaths per year [Llama 3 (Groq)], and globally, the number of vending machine-related deaths could be higher [Llama 3 (Groq)]. It's essential to note that both shark attacks and vending machine accidents are extremely rare, and the likelihood of being killed by either is extremely low [Llama 3 (Groq)].

Critique Scores — Radar Chart

Each model is scored across the five quality dimensions by the other models. The radar chart below shows how scores compare at a glance:

Benchmark Accuracy — Ensemble vs Individual

The ensemble consistently outperforms any single model. Below is a sample TruthfulQA run showing individual model accuracy versus the synthesized Artemis answer:

Key takeaway: The ensemble doesn't just pick the best model — it combines the strengths of all three, producing answers that are more accurate than any individual contributor.

The Streamlit UI

The interface is organized into five tabs:

Tab	What You See
Final Answer	The synthesized answer with a confidence badge and disagreement warning
Model Responses	Each model's raw output with success/failure status and latency
Critique	Written feedback from each critic explaining weaknesses
Scores	Heatmap table + line chart comparing all models across the five dimensions
Metrics	Total latency, total tokens, and a per-model performance breakdown

Benchmarks

You can evaluate the ensemble against standard datasets from the command line:

# TruthfulQA — tests factual accuracy
python run_benchmarks.py --benchmark_name truthfulqa --samples 100

# GSM8K — tests mathematical reasoning
python run_benchmarks.py --benchmark_name gsm8k --samples 50

Results are saved as CSV files to data/results/ with a summary printed to the console, including per-model and ensemble accuracy.

Project Structure

Artemis-Research/
│
├── app.py                      # Streamlit UI entry point
├── config.yaml                 # All tuneable parameters
├── requirements.txt            # Python dependencies
├── run_benchmarks.py           # CLI benchmark runner
├── .env                        # API keys (not committed)
│
├── Models/
│   ├── base.py                 # Abstract base class + ModelResponse
│   ├── groq.py                 # Groq / Llama 3 integration
│   ├── gemini.py               # Google Gemini integration
│   └── huggingface.py          # HuggingFace Inference API
│
├── pipeline/
│   ├── ensemble.py             # Core orchestrator — runs the full pipeline
│   ├── critique.py             # Cross-critique logic and prompt engineering
│   ├── scorer.py               # Score dataclass and JSON response parser
│   └── synthesis.py            # Final answer generation with citations
│
├── evaluation/
│   └── benchmark.py            # TruthfulQA & GSM8K evaluation harness
│
├── data/                       # Benchmark output directory
├── images/                     # Architecture diagrams and assets
├── logs/                       # Runtime logs
└── utils/                      # Extensible utility modules

Configuration

Everything tuneable lives in config.yaml:

pipeline:
  mode: cross_critique          # critique strategy
  disagreement_threshold: 0.6   # cosine similarity cutoff
  timeout_seconds: 30           # per-model timeout
  retry_attempts: 2             # retries on failure

scoring:
  weights:
    factuality: 0.35
    confidence: 0.15
    completeness: 0.20
    consistency: 0.15
    reasoning: 0.15

evaluation:
  benchmarks: [truthfulqa, gsm8k]
  sample_size: 100
  output_dir: data/results

Built With

What	Why
Streamlit	Interactive research UI without frontend overhead
asyncio	Parallel model calls — no waiting in sequence
scikit-learn	TF-IDF + cosine similarity for disagreement detection
Matplotlib & Pandas	Score visualisation and data handling
HuggingFace Datasets	Standard benchmark datasets (TruthfulQA, GSM8K)
Loguru	Structured logging
python-dotenv	Secure API key management

License

This project is intended for research and educational purposes.

_{Artemis Consensus — because the best answer is the one that survives peer review.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artemis Consensus

What Is This?

Quick Start

Prerequisites

Installation

Environment Setup

Launch

How It Works

Step 1 — Parallel Generation

Step 2 — Disagreement Detection

Step 3 — Cross-Critique

Step 4 — Synthesis

Scoring

Sample Output

Critique Scores — Radar Chart

Benchmark Accuracy — Ensemble vs Individual

The Streamlit UI

Benchmarks

Project Structure

Configuration

Built With

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
Models		Models
evaluation		evaluation
images		images
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.yaml		config.yaml
requirements.txt		requirements.txt
run_benchmarks.py		run_benchmarks.py

Folders and files

Latest commit

History

Repository files navigation

Artemis Consensus

What Is This?

Quick Start

Prerequisites

Installation

Environment Setup

Launch

How It Works

Step 1 — Parallel Generation

Step 2 — Disagreement Detection

Step 3 — Cross-Critique

Step 4 — Synthesis

Scoring

Sample Output

Critique Scores — Radar Chart

Benchmark Accuracy — Ensemble vs Individual

The Streamlit UI

Benchmarks

Project Structure

Configuration

Built With

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages