-
Notifications
You must be signed in to change notification settings - Fork 1k
Optimize factual correctness metric runtime by 50% #2153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimize factual correctness metric runtime by 50% #2153
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR optimizes the FactualCorrectness
metric by implementing concurrent execution of independent operations, reducing evaluation time by approximately 50%. The factual correctness metric works by comparing claims between a response and reference in both directions: decomposing the response into claims and verifying them against the reference (reference_response
), and decomposing the reference into claims and verifying them against the response (response_reference
).
The key insight is that these two operations are completely independent - they don't need to wait for each other's results. Previously, they were executed sequentially using separate await
calls, which meant the total execution time was the sum of both operations. The optimization uses asyncio.gather()
to run these tasks concurrently, reducing the total time to the maximum of either operation rather than their sum.
The implementation adds a new helper method decompose_and_verify_claims()
that encapsulates the two-step process of claim decomposition and verification. For cases where only precision mode is used, a lightweight get_empty_response()
helper is provided since the response_reference
evaluation isn't needed. This change integrates seamlessly with the existing factual correctness evaluation pipeline and maintains all the existing functionality while significantly improving performance for the common case where both directions of evaluation are required.
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it only changes execution order without altering logic
- Score reflects a solid performance optimization with clean implementation and no breaking changes
- Pay close attention to the asyncio concurrency patterns and ensure proper error handling in concurrent tasks
1 file reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @vignesh14052002
Some thoughts have left for changes. Please also rebase to latest main.
response_claims = await self.decompose_claims(response, callbacks) | ||
reference_response = await self.verify_claims( | ||
premise=reference, hypothesis_list=response_claims, callbacks=callbacks | ||
async def get_empty_response(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to make it a class method for reusability
@@ -287,5 +293,13 @@ async def _single_turn_ascore( | |||
|
|||
return np.round(score, 2) | |||
|
|||
async def decompose_and_verify_claims( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this something more generic for clarity.
How about get_claim_verification_results
?
response_reference need not to wait for reference_response, so made it concurrent, it reduces the overall evaluation time by half