#feat: Add OpenAI HealthBench (#2399) #2433

scyyh11 · 2025-05-18T13:17:25Z

Description

This PR adds support for the OpenAI HealthBench benchmark into the CAMEL-AI evaluation framework.

It includes:

A new HealthBenchmark class with support for multiple variants (test, hard, consensus)
A working example script (examples/healthbench.py)
Unit test using mocks to simulate completions and grader behavior

Note: The grader prompt currently uses the original rubric prompt from OpenAI's HealthBench repo. Please advise if I should modify this.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

I have read the CONTRIBUTION guide (required)
I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
I have updated the tests accordingly (required for a bug fix or a new feature)
I have updated the documentation if needed:
I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

- Implemented HealthBench benchmark for CAMEL-AI based on OpenAI's published rubric and dataset format - Added support for multiple dataset variants (`test`, `hard`, `consensus`) - Introduced dual-agent evaluation (ChatAgent for completion and grading) with rubric-based prompt injection and structured result aggregation - Stored results in CAMEL-compatible self._results format for downstream reporting - Added pytest coverage for HealthBench using real dataset samples, mock agents, and write validation

Wendong-Fan

thanks @scyyh11 's contribution, left some initial comment, the pre-commit error https://github.com/camel-ai/camel/actions/runs/15096351274/job/42431535227?pr=2433 and spelling check error https://github.com/camel-ai/camel/actions/runs/15096351278/job/42431535230?pr=2433 need to be fixed

Wendong-Fan · 2025-05-18T17:36:12Z

camel/benchmarks/healthbench.py

+
+    def run(
+        self,
+        agent: ChatAgent,


could we also support testing RolePlaying and Workforce?

I checked the original code of OpenAI's Healthbench, when running evaluations with this script, each test case is evaluated independently using a new stateless LLM prompt.

Implemented a WorkforceAgent wrapper class using the CAMEL Workforce framework, enabling collaborative response generation via Proposer, Critic, and Finalizer agents. Updated the HealthBench evaluation pipeline to accept either a single ChatAgent or a WorkforceAgent, allowing plug-and-play benchmarking of both single and multi-agent systems. Maintained compatibility with existing grading logic by ensuring the workforce agent exposes a .step() interface identical to ChatAgent. Added clear system prompts for each agent role and coordinated multi-step task execution using CAMEL's Workforce and Task abstractions. No changes to the grading or rubric evaluation components; results remain directly comparable between single-agent and workforce agent runs.

zjrwtx · 2025-05-29T13:18:04Z

thanks @scyyh11 ,please resolve conflicts first,let me know if you need any help

scyyh11 · 2025-05-30T00:49:59Z

thanks @scyyh11 ,please resolve conflicts first,let me know if you need any help

@zjrwtx conflict has been solved.

lightaime · 2025-05-30T11:16:15Z

camel/benchmarks/healthbench.py

+)
+
+
+class WorkforceAgent:


Not sure if there is the best place to implement this logic. There are two design choices.

Allow .run to accept Workfoce and add a step interface in Workfoce to unify the interface of Workfoce and ChatAgent. This will make it easier for evaluating Workfoce in all different benchmarks.

Allow .run to accept Workfoce and make WorkforceAgent a private class healthbench.py and the mimic ChatAgent's .step() logic into .run. But this will require more implementation effore when building a new benchmark since it requires implementing the logic of handling workforce every time.

@Wendong-Fan for looking into this as well

So this WorkforceAgent was originally created as a helper class for HealthBench, to wrap a fixed proposer/critic/finalizer workforce and integrate it into existing benchmarking API. It currently assumes a fixed set of roles and a specific workflow, but if it could easily serve as a template for a more general-purpose adapter, I’m happy to refactor it to support customizable role sets and workflows.

lightaime · 2025-05-30T11:17:05Z

camel/benchmarks/healthbench.py

+PROPOSER_PROMPT = (
+    "You are a diligent medical assistant (Proposer) whose job is to draft a complete, helpful, and safe answer to the user's medical question."
+)
+CRITIC_PROMPT = (
+    "You are a medical safety and accuracy reviewer (Critic). Review the Proposer's draft answer, pointing out any mistakes, dangerous advice, or missing important medical details."
+)
+FINALIZER_PROMPT = (
+    "You are a senior medical assistant (Finalizer). Carefully integrate the Critic's feedback to produce a final, clear, medically sound answer for the user."
+)


Should we move this to the example code instead of putting it inside the library?

lightaime · 2025-05-30T11:17:15Z

camel/benchmarks/healthbench.py

+        self.proposer = ChatAgent(system_message=PROPOSER_PROMPT)
+        self.critic = ChatAgent(system_message=CRITIC_PROMPT)
+        self.finalizer = ChatAgent(system_message=FINALIZER_PROMPT)
+
+        self.workforce = Workforce("HealthBench Judge Committee")
+        self.workforce.add_single_agent_worker(
+                "Proposer (drafts answer to the medical question)", worker=self.proposer
+            ).add_single_agent_worker(
+                "Critic (reviews the draft for accuracy, completeness, and safety)", worker=self.critic
+            ).add_single_agent_worker(
+                "Finalizer (integrates critic feedback into a final, polished answer)", worker=self.finalizer
+            )
+
+        # Optionally, keep role names for reference.
+        self.role_names = ["Proposer", "Critic", "Finalizer"]


Should we move this to the example code instead of putting it inside the library?

lightaime

Thanks @scyyh11 for your contribution!

- Moved HealthBench initialization code from `camel/benchmarks/healthbench.py` to `examples/benchmarks/healthbench.py`, so example setup is separated from core benchmark logic. - Extracted `WorkforceAgent` into `camel/societies/workforce/workforce_agent.py` as an independent, reusable adapter for integrating a multi-agent Workforce into any benchmark. - Updated `examples/benchmarks/healthbench.py` to use the new adaptive `WorkforceAgent` class, simplifying workforce evaluation and paving the way for reuse in other benchmark modules.

zjrwtx · 2025-06-03T10:41:45Z

thanks @scyyh11 ,we are also currently refactoring the benchmark module (see PR : #2519). It could need some time. Let’s continue to advance this feature after the refactoring is complete so that we can adhere to the unified interface for feature development, there’s no rush for this benchmark integration for now

coderabbitai · 2025-06-10T00:29:25Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Adapt workforce_agent.py to the updated Workforce constructor and task-processing flow. Pass through workforce_kwargs (e.g., coordinator_agent_kwargs, task_agent_kwargs, graceful_shutdown_timeout). Combine task_instruction and user input into Task.content; move user text out of additional_info. Reset workforce state after each call to ensure stateless benchmark runs. Update examples/benchmarks/healthbench.py to use the new agent signature. Save results to results.jsonl in the working directory.

- Introduced DocumentToolkit for extracting content from various file types: • Supports text, Office docs, PDFs, code, Excel, images, JSON/XML, ZIP archives, and webpages • Uses pluggable loader interface with MarkItDown and UnstructuredIO • Includes caching mechanism and specialized handlers per file type - Added standalone example script: • Demonstrates DocumentToolkit usage with temp Markdown file • Integrates with ChatAgent and GPT-4o-mini for end-to-end document querying - Designed for extensibility, model-agnostic use, and future toolkit integration

…m HTTP headers and content-type.

# Conflicts: # camel/toolkits/__init__.py

scyyh11 and others added 4 commits May 18, 2025 08:49

Merge branch 'camel-ai:master' into master

1ec9520

Added example usage of HealthBenchmark.

bf275d4

Merge remote-tracking branch 'origin/master'

f94c789

Wendong-Fan assigned scyyh11 May 18, 2025

Wendong-Fan added New Feature benchmark Review Required PR need to be reviewed labels May 18, 2025

Wendong-Fan added this to Project Camel May 18, 2025

Wendong-Fan added this to the Sprint 29 milestone May 18, 2025

Wendong-Fan requested review from zjrwtx and Zhangzeyu97 May 18, 2025 17:33

Wendong-Fan linked an issue May 18, 2025 that may be closed by this pull request

[Feature Request] integrate openai healthbench #2399

Open

2 tasks

Wendong-Fan reviewed May 18, 2025

View reviewed changes

scyyh11 and others added 7 commits May 18, 2025 18:14

fix: correct spelling of 'criteria' to pass spell check

0b6a526

Merge branch 'camel-ai:master' into master

f39ad9a

Merge branch 'camel-ai:master' into master

b6c6d4a

Auto-update documentation [skip ci]

a76a0f2

Merge branch 'master' into master

51109b4

Auto-update documentation [skip ci]

77ff658

Merge branch 'master' into master

45cdb53

lightaime reviewed May 30, 2025

View reviewed changes

scyyh11 and others added 2 commits May 30, 2025 11:18

Merge branch 'camel-ai:master' into master

d55d5d7

Merge branch 'camel-ai:master' into master

d9e19bc

scyyh11 and others added 7 commits June 9, 2025 18:49

Merge branch 'camel-ai:master' into master

91d1954

Merge remote-tracking branch 'origin/master'

ca140ff

Improves _download_file method to properly detect file extensions fro…

b185aec

…m HTTP headers and content-type.

Merge branch 'camel-ai:master' into master

e1163c0

Merge remote-tracking branch 'origin/master'

2a3e681

# Conflicts: # camel/toolkits/__init__.py

scyyh11 closed this Jun 18, 2025

#feat: Add OpenAI HealthBench (#2399) #2433

#feat: Add OpenAI HealthBench (#2399) #2433

Uh oh!

Conversation

scyyh11 commented May 18, 2025

Description

Checklist

Uh oh!

Wendong-Fan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wendong-Fan May 18, 2025

Choose a reason for hiding this comment

Uh oh!

scyyh11 May 20, 2025

Choose a reason for hiding this comment

Uh oh!

zjrwtx commented May 29, 2025

Uh oh!

scyyh11 commented May 30, 2025

Uh oh!

lightaime May 30, 2025

Choose a reason for hiding this comment

Uh oh!

lightaime May 30, 2025

Choose a reason for hiding this comment

Uh oh!

scyyh11 May 30, 2025

Choose a reason for hiding this comment

Uh oh!

lightaime May 30, 2025

Choose a reason for hiding this comment

Uh oh!

lightaime May 30, 2025

Choose a reason for hiding this comment

Uh oh!

lightaime left a comment

Choose a reason for hiding this comment

Uh oh!

zjrwtx commented Jun 3, 2025

Uh oh!

coderabbitai bot commented Jun 10, 2025

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Uh oh!

Wendong-Fan left a comment •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)