-
Notifications
You must be signed in to change notification settings - Fork 1.4k
#feat: Add OpenAI HealthBench (#2399) #2433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Implemented HealthBench benchmark for CAMEL-AI based on OpenAI's published rubric and dataset format - Added support for multiple dataset variants (`test`, `hard`, `consensus`) - Introduced dual-agent evaluation (ChatAgent for completion and grading) with rubric-based prompt injection and structured result aggregation - Stored results in CAMEL-compatible self._results format for downstream reporting - Added pytest coverage for HealthBench using real dataset samples, mock agents, and write validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @scyyh11 's contribution, left some initial comment, the pre-commit error https://github.com/camel-ai/camel/actions/runs/15096351274/job/42431535227?pr=2433 and spelling check error https://github.com/camel-ai/camel/actions/runs/15096351278/job/42431535230?pr=2433 need to be fixed
camel/benchmarks/healthbench.py
Outdated
|
||
def run( | ||
self, | ||
agent: ChatAgent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also support testing RolePlaying
and Workforce
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the original code of OpenAI's Healthbench, when running evaluations with this script, each test case is evaluated independently using a new stateless LLM prompt.
Implemented a WorkforceAgent wrapper class using the CAMEL Workforce framework, enabling collaborative response generation via Proposer, Critic, and Finalizer agents. Updated the HealthBench evaluation pipeline to accept either a single ChatAgent or a WorkforceAgent, allowing plug-and-play benchmarking of both single and multi-agent systems. Maintained compatibility with existing grading logic by ensuring the workforce agent exposes a .step() interface identical to ChatAgent. Added clear system prompts for each agent role and coordinated multi-step task execution using CAMEL's Workforce and Task abstractions. No changes to the grading or rubric evaluation components; results remain directly comparable between single-agent and workforce agent runs.
thanks @scyyh11 ,please resolve conflicts first,let me know if you need any help |
camel/benchmarks/healthbench.py
Outdated
) | ||
|
||
|
||
class WorkforceAgent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is the best place to implement this logic. There are two design choices.
- Allow
.run
to acceptWorkfoce
and add astep
interface inWorkfoce
to unify the interface ofWorkfoce
andChatAgent
. This will make it easier for evaluatingWorkfoce
in all different benchmarks. - Allow
.run
to acceptWorkfoce
and makeWorkforceAgent
a private classhealthbench.py
and the mimic ChatAgent's.step()
logic into.run
. But this will require more implementation effore when building a new benchmark since it requires implementing the logic of handling workforce every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wendong-Fan for looking into this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this WorkforceAgent
was originally created as a helper class for HealthBench
, to wrap a fixed proposer/critic/finalizer workforce and integrate it into existing benchmarking API. It currently assumes a fixed set of roles and a specific workflow, but if it could easily serve as a template for a more general-purpose adapter, I’m happy to refactor it to support customizable role sets and workflows.
camel/benchmarks/healthbench.py
Outdated
PROPOSER_PROMPT = ( | ||
"You are a diligent medical assistant (Proposer) whose job is to draft a complete, helpful, and safe answer to the user's medical question." | ||
) | ||
CRITIC_PROMPT = ( | ||
"You are a medical safety and accuracy reviewer (Critic). Review the Proposer's draft answer, pointing out any mistakes, dangerous advice, or missing important medical details." | ||
) | ||
FINALIZER_PROMPT = ( | ||
"You are a senior medical assistant (Finalizer). Carefully integrate the Critic's feedback to produce a final, clear, medically sound answer for the user." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to the example code instead of putting it inside the library?
camel/benchmarks/healthbench.py
Outdated
self.proposer = ChatAgent(system_message=PROPOSER_PROMPT) | ||
self.critic = ChatAgent(system_message=CRITIC_PROMPT) | ||
self.finalizer = ChatAgent(system_message=FINALIZER_PROMPT) | ||
|
||
self.workforce = Workforce("HealthBench Judge Committee") | ||
self.workforce.add_single_agent_worker( | ||
"Proposer (drafts answer to the medical question)", worker=self.proposer | ||
).add_single_agent_worker( | ||
"Critic (reviews the draft for accuracy, completeness, and safety)", worker=self.critic | ||
).add_single_agent_worker( | ||
"Finalizer (integrates critic feedback into a final, polished answer)", worker=self.finalizer | ||
) | ||
|
||
# Optionally, keep role names for reference. | ||
self.role_names = ["Proposer", "Critic", "Finalizer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to the example code instead of putting it inside the library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scyyh11 for your contribution!
- Moved HealthBench initialization code from `camel/benchmarks/healthbench.py` to `examples/benchmarks/healthbench.py`, so example setup is separated from core benchmark logic. - Extracted `WorkforceAgent` into `camel/societies/workforce/workforce_agent.py` as an independent, reusable adapter for integrating a multi-agent Workforce into any benchmark. - Updated `examples/benchmarks/healthbench.py` to use the new adaptive `WorkforceAgent` class, simplifying workforce evaluation and paving the way for reuse in other benchmark modules.
thanks @scyyh11 ,we are also currently refactoring the benchmark module (see PR : #2519). It could need some time. Let’s continue to advance this feature after the refactoring is complete so that we can adhere to the unified interface for feature development, there’s no rush for this benchmark integration for now |
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Adapt workforce_agent.py to the updated Workforce constructor and task-processing flow. Pass through workforce_kwargs (e.g., coordinator_agent_kwargs, task_agent_kwargs, graceful_shutdown_timeout). Combine task_instruction and user input into Task.content; move user text out of additional_info. Reset workforce state after each call to ensure stateless benchmark runs. Update examples/benchmarks/healthbench.py to use the new agent signature. Save results to results.jsonl in the working directory.
- Introduced DocumentToolkit for extracting content from various file types: • Supports text, Office docs, PDFs, code, Excel, images, JSON/XML, ZIP archives, and webpages • Uses pluggable loader interface with MarkItDown and UnstructuredIO • Includes caching mechanism and specialized handlers per file type - Added standalone example script: • Demonstrates DocumentToolkit usage with temp Markdown file • Integrates with ChatAgent and GPT-4o-mini for end-to-end document querying - Designed for extensibility, model-agnostic use, and future toolkit integration
…m HTTP headers and content-type.
# Conflicts: # camel/toolkits/__init__.py
Description
#2399
This PR adds support for the OpenAI HealthBench benchmark into the CAMEL-AI evaluation framework.
It includes:
HealthBenchmark
class with support for multiple variants (test
,hard
,consensus
)examples/healthbench.py
)Note: The grader prompt currently uses the original rubric prompt from OpenAI's HealthBench repo. Please advise if I should modify this.
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.Fixes #issue-number
in the PR description (required)pyproject.toml
anduv lock
If you are unsure about any of these, don't hesitate to ask. We are here to help!