Skip to content

#feat: Add OpenAI HealthBench (#2399) #2433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from
Closed

Conversation

scyyh11
Copy link

@scyyh11 scyyh11 commented May 18, 2025

Description

#2399

This PR adds support for the OpenAI HealthBench benchmark into the CAMEL-AI evaluation framework.

It includes:

  • A new HealthBenchmark class with support for multiple variants (test, hard, consensus)
  • A working example script (examples/healthbench.py)
  • Unit test using mocks to simulate completions and grader behavior

Note: The grader prompt currently uses the original rubric prompt from OpenAI's HealthBench repo. Please advise if I should modify this.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have read the CONTRIBUTION guide (required)
  • I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
  • I have updated the tests accordingly (required for a bug fix or a new feature)
  • I have updated the documentation if needed:
  • I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

scyyh11 and others added 4 commits May 18, 2025 08:49
- Implemented HealthBench benchmark for CAMEL-AI based on OpenAI's published rubric and dataset format
- Added support for multiple dataset variants (`test`, `hard`, `consensus`)
- Introduced dual-agent evaluation (ChatAgent for completion and grading) with rubric-based prompt injection and structured result aggregation
- Stored results in CAMEL-compatible self._results format for downstream reporting
- Added pytest coverage for HealthBench using real dataset samples, mock agents, and write validation
@Wendong-Fan Wendong-Fan added this to the Sprint 29 milestone May 18, 2025
@Wendong-Fan Wendong-Fan requested review from zjrwtx and Zhangzeyu97 May 18, 2025 17:33
@Wendong-Fan Wendong-Fan linked an issue May 18, 2025 that may be closed by this pull request
2 tasks
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @scyyh11 's contribution, left some initial comment, the pre-commit error https://github.com/camel-ai/camel/actions/runs/15096351274/job/42431535227?pr=2433 and spelling check error https://github.com/camel-ai/camel/actions/runs/15096351278/job/42431535230?pr=2433 need to be fixed


def run(
self,
agent: ChatAgent,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also support testing RolePlaying and Workforce?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the original code of OpenAI's Healthbench, when running evaluations with this script, each test case is evaluated independently using a new stateless LLM prompt.

scyyh11 and others added 7 commits May 18, 2025 18:14
Implemented a WorkforceAgent wrapper class using the CAMEL Workforce framework, enabling collaborative response generation via Proposer, Critic, and Finalizer agents.

Updated the HealthBench evaluation pipeline to accept either a single ChatAgent or a WorkforceAgent, allowing plug-and-play benchmarking of both single and multi-agent systems.

Maintained compatibility with existing grading logic by ensuring the workforce agent exposes a .step() interface identical to ChatAgent.

Added clear system prompts for each agent role and coordinated multi-step task execution using CAMEL's Workforce and Task abstractions.

No changes to the grading or rubric evaluation components; results remain directly comparable between single-agent and workforce agent runs.
@zjrwtx
Copy link
Collaborator

zjrwtx commented May 29, 2025

thanks @scyyh11 ,please resolve conflicts first,let me know if you need any help

@scyyh11
Copy link
Author

scyyh11 commented May 30, 2025

thanks @scyyh11 ,please resolve conflicts first,let me know if you need any help

@zjrwtx conflict has been solved.

)


class WorkforceAgent:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there is the best place to implement this logic. There are two design choices.

  1. Allow .run to accept Workfoce and add a step interface in Workfoce to unify the interface of Workfoce and ChatAgent. This will make it easier for evaluating Workfoce in all different benchmarks.
  2. Allow .run to accept Workfoce and make WorkforceAgent a private class healthbench.py and the mimic ChatAgent's .step() logic into .run. But this will require more implementation effore when building a new benchmark since it requires implementing the logic of handling workforce every time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wendong-Fan for looking into this as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this WorkforceAgent was originally created as a helper class for HealthBench, to wrap a fixed proposer/critic/finalizer workforce and integrate it into existing benchmarking API. It currently assumes a fixed set of roles and a specific workflow, but if it could easily serve as a template for a more general-purpose adapter, I’m happy to refactor it to support customizable role sets and workflows.

Comment on lines 88 to 96
PROPOSER_PROMPT = (
"You are a diligent medical assistant (Proposer) whose job is to draft a complete, helpful, and safe answer to the user's medical question."
)
CRITIC_PROMPT = (
"You are a medical safety and accuracy reviewer (Critic). Review the Proposer's draft answer, pointing out any mistakes, dangerous advice, or missing important medical details."
)
FINALIZER_PROMPT = (
"You are a senior medical assistant (Finalizer). Carefully integrate the Critic's feedback to produce a final, clear, medically sound answer for the user."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to the example code instead of putting it inside the library?

Comment on lines 108 to 122
self.proposer = ChatAgent(system_message=PROPOSER_PROMPT)
self.critic = ChatAgent(system_message=CRITIC_PROMPT)
self.finalizer = ChatAgent(system_message=FINALIZER_PROMPT)

self.workforce = Workforce("HealthBench Judge Committee")
self.workforce.add_single_agent_worker(
"Proposer (drafts answer to the medical question)", worker=self.proposer
).add_single_agent_worker(
"Critic (reviews the draft for accuracy, completeness, and safety)", worker=self.critic
).add_single_agent_worker(
"Finalizer (integrates critic feedback into a final, polished answer)", worker=self.finalizer
)

# Optionally, keep role names for reference.
self.role_names = ["Proposer", "Critic", "Finalizer"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to the example code instead of putting it inside the library?

Copy link
Member

@lightaime lightaime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scyyh11 for your contribution!

scyyh11 and others added 2 commits May 30, 2025 11:18
- Moved HealthBench initialization code from `camel/benchmarks/healthbench.py` to `examples/benchmarks/healthbench.py`, so example setup is separated from core benchmark logic.
- Extracted `WorkforceAgent` into `camel/societies/workforce/workforce_agent.py` as an independent, reusable adapter for integrating a multi-agent Workforce into any benchmark.
- Updated `examples/benchmarks/healthbench.py` to use the new adaptive `WorkforceAgent` class, simplifying workforce evaluation and paving the way for reuse in other benchmark modules.
@zjrwtx
Copy link
Collaborator

zjrwtx commented Jun 3, 2025

thanks @scyyh11 ,we are also currently refactoring the benchmark module (see PR : #2519). It could need some time. Let’s continue to advance this feature after the refactoring is complete so that we can adhere to the unified interface for feature development, there’s no rush for this benchmark integration for now

Copy link

coderabbitai bot commented Jun 10, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

scyyh11 and others added 7 commits June 9, 2025 18:49
Adapt workforce_agent.py to the updated Workforce constructor and task-processing flow.

Pass through workforce_kwargs (e.g., coordinator_agent_kwargs, task_agent_kwargs, graceful_shutdown_timeout).

Combine task_instruction and user input into Task.content; move user text out of additional_info.

Reset workforce state after each call to ensure stateless benchmark runs.

Update examples/benchmarks/healthbench.py to use the new agent signature.

Save results to results.jsonl in the working directory.
- Introduced DocumentToolkit for extracting content from various file types:
  • Supports text, Office docs, PDFs, code, Excel, images, JSON/XML, ZIP archives, and webpages
  • Uses pluggable loader interface with MarkItDown and UnstructuredIO
  • Includes caching mechanism and specialized handlers per file type

- Added standalone example script:
  • Demonstrates DocumentToolkit usage with temp Markdown file
  • Integrates with ChatAgent and GPT-4o-mini for end-to-end document querying

- Designed for extensibility, model-agnostic use, and future toolkit integration
# Conflicts:
#	camel/toolkits/__init__.py
@scyyh11 scyyh11 closed this Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Feature Request] integrate openai healthbench
5 participants