A LangChain ecosystem chatbot ("Chat LangChain Lite") with intentional bugs, built to demonstrate LangSmith Engine's ability to identify issues in agent traces and propose fixes via PR. The agent answers questions about LangChain, LangGraph, LangSmith, and Deep Agents using three tools: lookup_concept, get_setup_guide, and get_security_advice.
- Engine identifies bugs — the agent has bugs in the prompt and code that cause bad responses
- Engine proposes a PR fix — targets the root cause code and opens a PR on your fork
- Engine proposes offline examples and online evals to add — expand dataset coverage and monitoring with one click
- Offline evals in CI/CD — the PR can't merge until eval scores pass a threshold
- Before/after scores in LangSmith — both "before" and "after" experiments created automatically by CI when Engine opens a PR
Bugs are spread across three files so Engine has to reason about code, not just prompts:
| Bug | File / Location | Effect | Caught by |
|---|---|---|---|
| "Never use tools, never decline" instruction | LangSmith Context Hub (chat-lc-lite-agent-robert / AGENTS.md) — fix in the Context Hub UI, not the repo |
Answers any topic; answers from memory instead of calling tools | tool_usage, scope_adherence |
| Casual / emoji voice | LangSmith Context Hub (chat-lc-lite-agent-robert / AGENTS.md) — fix in the Context Hub UI, not the repo |
Every response starts with "Hey there! 👋", uses emojis throughout, ends with "Happy building! 🚀" | professional_tone |
| Wrong docs URL in SAFE_PATTERNS | agent/tools.py |
Agent recommends stale python.langchain.com / js.langchain.com links instead of docs.langchain.com |
security_advice |
| Wrong LangGraph min Python version | agent/tools.py |
Returns "3.7+" instead of the correct "3.10+" | factual_accuracy |
max_tokens=300 |
agent/agent.py |
Truncates responses on complex technical questions | response_completeness |
1. Fork and clone this repo
2. Create a virtual environment
uv sync
source .venv/bin/activateOr with pip:
python3 -m venv .venv
source .venv/bin/activate
pip install -e .3. Configure environment
cp .env.example .envEdit .env:
ANTHROPIC_API_KEY=your-key
LANGSMITH_API_KEY=your-demo-workspace-api-key
LANGSMITH_PROJECT=chat-lc-lite
LANGSMITH_WORKSPACE_ID=your-demo-workspace-id
LANGSMITH_TRACING=true
If multiple presenters share a LangSmith workspace, use a unique
LANGSMITH_PROJECTper person (e.g.chat-lc-lite-morgan) to avoid mixing traces and online evaluators. The project is created automatically on first use.
4. Run one-shot setup
python -m scripts.setupThis does three things in one command:
- Creates the LangSmith project by sending one trace (required before online evaluators can be registered)
- Creates the dataset
chat-lc-lite-scope-<your-name>with 3 curated test cases, then tags that version asbaselinein LangSmith - Creates 5 online evaluators in the LangSmith Evaluators UI at 100% sampling rate — every future trace is automatically scored for
security_advice,scope_adherence,tool_usage,response_completeness, andfactual_accuracy. Their run rule IDs are saved to.demo_state.jsonso cleanup can tell them apart from evaluators Engine adds.
Only needs to be run once. Between demos, run python -m scripts.cleanup instead.
5. Generate traces
python -m scripts.generate_tracesRuns 13 single-turn queries and 3 multi-turn threaded conversations through the buggy agent to populate LangSmith with trace and thread variety beyond the dataset examples.
6. Add GitHub secrets (for CI/CD)
In your fork: Settings → Secrets → Actions → add ANTHROPIC_API_KEY, LANGSMITH_API_KEY, LANGSMITH_PROJECT, and LANGSMITH_WORKSPACE_ID.
Important: When pasting secrets, make sure there are no trailing newlines or spaces.
7. Enable GitHub Actions
In your fork: Actions → (if prompted) enable workflows. GitHub disables Actions on forks by default — this step is required for offline evals to run on PRs.
8. Connect Engine
In LangSmith Engine, connect your LangSmith project (LANGSMITH_PROJECT) and your GitHub fork so Engine can read traces and open PRs against your repo.
# One-shot setup: creates dataset, sets up online evaluators
python -m scripts.setup
# Generate more traces including threads
python -m scripts.generate_traces
# Start the chat UI
streamlit run app.py- Show Chat LangChain Lite UI — ask questions (concept lookups, setup guides, security advice, etc.)
- Show traces in LangSmith with online eval scores (
security_advice,scope_adherence, etc.) - Engine analyzes traces and identifies root causes across prompt and code
- Add Engine-suggested offline examples — show ability to edit in annotation queue
- Engine opens a PR on your fork
- GitHub Actions runs evals on main (before experiment) and the PR branch (after experiment) — after scores pass ✅
- Merge the PR
- Add Engine-suggested online eval
- Show the experiments in LangSmith — before/after score comparison
python -m scripts.cleanup| Script | What it does |
|---|---|
python -m scripts.setup |
One-shot setup: creates dataset and creates 5 online evaluators |
python -m scripts.generate_traces |
Runs 13 single-turn queries + 3 multi-turn threads through the buggy agent |
python -m scripts.run_evals |
Runs offline evals against the dataset and prints scores |
python -m scripts.run_evals --skip-dataset |
Re-runs evals against existing dataset (used in CI) |
python -m scripts.run_evals --threshold 0.7 |
Exits with code 1 if scores < 0.7 (used in CI) |
python -m scripts.cleanup |
Resets demo to clean state — see Cleanup section |
python -m scripts.cleanup --full |
Same, plus deletes the LangSmith project (so Engine sees a fresh project on the next demo). Re-run scripts.setup after. |
streamlit run app.py |
Start the Chat LangChain Lite UI |
Two LLM-as-judge evaluators run in CI (offline). Claude Haiku scores each 0 or 1:
tool_selection— did the agent ground its response in tool output rather than answering from memory? Goes 0→1 when the bad system prompt is fixed.scope_adherence— did the agent stay LangChain-ecosystem-only and decline off-topic questions?
Online evaluators run automatically on every trace as it arrives in LangSmith. This gives Engine a continuous signal on live traffic, not just offline evals on a fixed dataset.
Five online evaluators are registered by python -m scripts.setup: security_advice, scope_adherence, tool_usage, response_completeness, and factual_accuracy.
.github/workflows/evals.yml runs automatically on every PR to main.
Add these secrets to your repo (Settings → Secrets → Actions):
ANTHROPIC_API_KEYLANGSMITH_API_KEYLANGSMITH_PROJECTLANGSMITH_WORKSPACE_ID
LANGSMITH_PROJECT should match what you used locally — that's the project the agent traces against.
PR opened → GitHub Actions → run_evals --skip-dataset --threshold 0.7
↓
scores < 0.7 → ❌ blocks merge
scores ≥ 0.7 → ✅ mergeable
CI runs evals on both the base branch (creating the "before" experiment) and the PR branch (creating the "after" experiment) in LangSmith automatically. Because --skip-dataset fetches the existing dataset from LangSmith by name, any examples Engine adds to the dataset are included in the eval run automatically.
context/
└── __init__.py # get_prompt() — pulls the agent's system prompt from
# LangSmith Context Hub at runtime. The prompt content
# lives in the hub, not this repo (Bugs 1 & 5 are fixed
# in the Context Hub UI, not via code PR).
agent/
├── tools.py # concept lookup, setup guides, security advice (Bugs 2 & 3)
└── agent.py # create_agent + FilesystemMiddleware (Bug 4 — max_tokens)
utils/
└── context_hub.py # setup-time push helper. Holds the *initial seed* for
# Context Hub only; not the runtime source of truth.
evals/
├── dataset.py # creates per-user LangSmith dataset (3 curated examples)
└── evaluators.py # 2 LLM-as-judge offline evaluators (used in CI)
scripts/
├── setup.py # one-shot setup: dataset + online evaluators + Context Hub
├── generate_traces.py # populate LangSmith with extra traces and threads
├── run_evals.py # offline evals + CI threshold check
└── cleanup.py # resets demo to clean state after presentation
.github/workflows/
├── evals.yml # CI/CD: label-gated offline evals on PRs to main
└── auto-label-engine-prs.yml # auto-tags Engine PRs with 'run-evals'
app.py # Chat LangChain Lite UI (Streamlit)
Run after the demo to reset everything for the next presenter:
python -m scripts.cleanupThis does four things:
- Resets dataset to original 3 examples — deletes all examples and re-uploads the canonical 3, removing anything Engine added
- Deletes all experiments — CI/CD generates fresh before/after experiments on every PR, so nothing needs to be preserved between demos
- Removes Engine-added online evaluators — uses saved run rule IDs from
.demo_state.jsonto delete only evaluators Engine added, leaving the 5 fromsetup.pyin place - Resets main to the
baselinetag — force-resets to remove Engine's merged PR, restoring the buggy agent state
After cleanup, the demo is ready to run again — no need to re-run setup.py.
For a full reset that also removes the LangSmith project (clearing all traces and Engine's per-project issue state):
python -m scripts.cleanup --full
python -m scripts.setup # recreates project, dataset, evaluators
python -m scripts.generate_tracesUse this when you want Engine to see a completely fresh project for the next demo — for example when a new presenter takes over and you don't want them to inherit any pre-flagged issues.