-
Notifications
You must be signed in to change notification settings - Fork 64
feat: support sciencemetabench #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @shijinpjlab, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Dingo framework's evaluation capabilities by integrating the ScienceMetaBench dataset. It introduces a structured approach to assess the quality of extracted metadata across various document types, offering flexible result storage and advanced reporting features. The changes aim to provide users with powerful tools for benchmarking and analyzing metadata extraction performance. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
此拉取请求增加了对 ScienceMetaBench 基准测试的支持,包括新的评估规则、数据处理和结果导出功能。代码结构清晰,并包含了完善的文档和测试。我的审查主要关注于提高代码的正确性和可维护性。主要建议包括:修复结果保存逻辑中的一个错误,重构评估规则中重复的代码,以及增强 Excel 导出功能的稳健性。此外,我还指出了文档中的一个拼写错误,并建议改进测试套件以覆盖更多边界情况。
| eval_details = dingo_result.get('eval_details', {}) | ||
| default_details = eval_details.get('default', []) | ||
|
|
||
| # 获取相似度字典 | ||
| similarity_dict = {} | ||
| if default_details and len(default_details) > 0: | ||
| reason_list = default_details[0].get('reason', []) | ||
| if reason_list and len(reason_list) > 0: | ||
| similarity_dict = reason_list[0].get('similarity', {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从 dingo_result 中提取相似度字典时,代码硬编码查找 eval_details 中的 'default' 键。然而,当评估配置中指定了 fields 时,这个键会是字段名的组合,而不是 'default'。这会导致在这些情况下无法提取相似度数据。建议修改此逻辑,使其能通用地从 eval_details 的值中查找相似度字典。
| eval_details = dingo_result.get('eval_details', {}) | |
| default_details = eval_details.get('default', []) | |
| # 获取相似度字典 | |
| similarity_dict = {} | |
| if default_details and len(default_details) > 0: | |
| reason_list = default_details[0].get('reason', []) | |
| if reason_list and len(reason_list) > 0: | |
| similarity_dict = reason_list[0].get('similarity', {}) | |
| eval_details = dingo_result.get('eval_details', {}) | |
| # 获取相似度字典 | |
| similarity_dict = {} | |
| for details_list in eval_details.values(): | |
| if details_list and len(details_list) > 0: | |
| reason_list = details_list[0].get('reason', []) | |
| if reason_list and len(reason_list) > 0 and 'similarity' in reason_list[0]: | |
| similarity_dict = reason_list[0].get('similarity', {}) | |
| break # 找到相似度字典后即可退出循环 |
| # if input_args.executor.result_save.raw: | ||
| # str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False) | ||
| # else: | ||
| # str_json = json.dumps(result_info.to_dict(), ensure_ascii=False) | ||
| str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在 merge 模式下,当前代码硬编码使用 result_info.to_raw_dict(),忽略了 input_args.executor.result_save.raw 配置。这与非合并模式下的行为不一致。建议恢复被注释掉的逻辑,以正确处理 raw 标志。
| # if input_args.executor.result_save.raw: | |
| # str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False) | |
| # else: | |
| # str_json = json.dumps(result_info.to_dict(), ensure_ascii=False) | |
| str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False) | |
| if input_args.executor.result_save.raw: | |
| str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False) | |
| else: | |
| str_json = json.dumps(result_info.to_dict(), ensure_ascii=False) |
| class RuleMetadataMatchTextbook(RuleMetadataMatchBase): | ||
| """ | ||
| 检查教科书(Textbook)元数据字段的相似度匹配 | ||
| 比较 benchmark 和 product 字段中的各个子字段,包括: | ||
| isbn, title, author, abstract, category, pub_time, publisher | ||
| 阈值为 0.6,只有所有字段的相似度都达到阈值才算通过 | ||
| """ | ||
|
|
||
| _metric_info = { | ||
| "category": "Rule-Based Metadata Quality Metrics", | ||
| "quality_dimension": "EFFECTIVENESS", | ||
| "metric_name": "RuleMetadataMatchTextbook", | ||
| "description": "检查教科书元数据字段与基准数据的相似度匹配,阈值为0.6", | ||
| } | ||
|
|
||
| dynamic_config = EvaluatorRuleArgs( | ||
| key_list=['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'], | ||
| threshold=0.6 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RuleMetadataMatchTextbook 和 RuleMetadataMatchEbook 的实现几乎完全相同,特别是 dynamic_config。为了减少代码重复并提高可维护性,可以考虑为它们创建一个共同的基类。
例如:
class RuleMetadataMatchBookBase(RuleMetadataMatchBase):
"""电子书和教科书元数据匹配的基类"""
dynamic_config = EvaluatorRuleArgs(
key_list=['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'],
threshold=0.6
)
@Model.rule_register("QUALITY_BAD_EFFECTIVENESS", ["sciencemetabench"])
class RuleMetadataMatchEbook(RuleMetadataMatchBookBase):
_metric_info = {
"category": "Rule-Based Metadata Quality Metrics",
"quality_dimension": "EFFECTIVENESS",
"metric_name": "RuleMetadataMatchEbook",
"description": "检查电子书元数据字段与基准数据的相似度匹配,阈值为0.6",
}
@Model.rule_register("QUALITY_BAD_EFFECTIVENESS", ["sciencemetabench"])
class RuleMetadataMatchTextbook(RuleMetadataMatchBookBase):
_metric_info = {
"category": "Rule-Based Metadata Quality Metrics",
"quality_dimension": "EFFECTIVENESS",
"metric_name": "RuleMetadataMatchTextbook",
"description": "检查教科书元数据字段与基准数据的相似度匹配,阈值为0.6",
}
docs/sciencemetabench/README.md
Outdated
|
|
||
| - **`benchmark`**: Ground truth (standard answer) | ||
| - **Source**: Obtained from [ScienceMetaBench Dataset](https://huggingface.co/datasets/opendatalab/ScienceMetaBench) | ||
| - **Included Fields**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class TestWriteSimilarityToExcel: | ||
| """测试 write_similarity_to_excel 函数""" | ||
|
|
||
| @pytest.fixture | ||
| def temp_output_dir(self): | ||
| """创建临时输出目录""" | ||
| temp_dir = tempfile.mkdtemp() | ||
| yield temp_dir | ||
| # 清理 | ||
| shutil.rmtree(temp_dir, ignore_errors=True) | ||
|
|
||
| @pytest.fixture | ||
| def sample_paper_data(self, temp_output_dir): | ||
| """创建示例 paper 数据""" | ||
| data = [ | ||
| { | ||
| "sha256": "test001", | ||
| "benchmark": { | ||
| "doi": "10.1234/test001", | ||
| "title": "Test Paper 1", | ||
| "author": "Author 1", | ||
| "keyword": "keyword1", | ||
| "abstract": "Abstract 1", | ||
| "pub_time": "2024" | ||
| }, | ||
| "product": { | ||
| "doi": "10.1234/test001", | ||
| "title": "Test Paper 1", | ||
| "author": "Author 1", | ||
| "keyword": "keyword1", | ||
| "abstract": "Abstract 1", | ||
| "pub_time": "2024" | ||
| }, | ||
| "dingo_result": { | ||
| "eval_status": True, | ||
| "eval_details": { | ||
| "default": [ | ||
| { | ||
| "metric": "RuleMetadataMatchPaper", | ||
| "status": True, | ||
| "label": ["QUALITY_GOOD"], | ||
| "reason": [ | ||
| { | ||
| "similarity": { | ||
| "doi": 1.0, | ||
| "title": 1.0, | ||
| "author": 1.0, | ||
| "keyword": 1.0, | ||
| "abstract": 1.0, | ||
| "pub_time": 1.0 | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "sha256": "test002", | ||
| "benchmark": { | ||
| "doi": "10.1234/test002", | ||
| "title": "Test Paper 2", | ||
| "author": "Author 2", | ||
| "keyword": "keyword2", | ||
| "abstract": "Abstract 2", | ||
| "pub_time": "2024" | ||
| }, | ||
| "product": { | ||
| "doi": "", | ||
| "title": "Different Title", | ||
| "author": "Author 2", | ||
| "keyword": "keyword2", | ||
| "abstract": "Different Abstract", | ||
| "pub_time": "2024" | ||
| }, | ||
| "dingo_result": { | ||
| "eval_status": True, | ||
| "eval_details": { | ||
| "default": [ | ||
| { | ||
| "metric": "RuleMetadataMatchPaper", | ||
| "status": True, | ||
| "label": ["QUALITY_BAD_EFFECTIVENESS.RuleMetadataMatchPaper.doi"], | ||
| "reason": [ | ||
| { | ||
| "similarity": { | ||
| "doi": 0.0, | ||
| "title": 0.5, | ||
| "author": 1.0, | ||
| "keyword": 1.0, | ||
| "abstract": 0.45, | ||
| "pub_time": 1.0 | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| } | ||
| } | ||
| ] | ||
|
|
||
| # 写入 jsonl 文件 | ||
| jsonl_file = Path(temp_output_dir) / "test_result.jsonl" | ||
| with open(jsonl_file, 'w', encoding='utf-8') as f: | ||
| for item in data: | ||
| f.write(json.dumps(item, ensure_ascii=False) + '\n') | ||
|
|
||
| return temp_output_dir | ||
|
|
||
| def test_write_paper_excel(self, sample_paper_data): | ||
| """测试导出 paper 类型的 Excel""" | ||
| output_filename = "test_paper.xlsx" | ||
|
|
||
| df = write_similarity_to_excel( | ||
| type='paper', | ||
| output_dir=sample_paper_data, | ||
| output_filename=output_filename | ||
| ) | ||
|
|
||
| # 验证返回的 DataFrame | ||
| assert df is not None | ||
| assert len(df) == 2 | ||
| assert 'sha256' in df.columns | ||
|
|
||
| # 验证所有 paper 字段都存在 | ||
| for field in ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time']: | ||
| assert f'benchmark_{field}' in df.columns | ||
| assert f'product_{field}' in df.columns | ||
| assert f'similarity_{field}' in df.columns | ||
|
|
||
| # 验证 Excel 文件是否创建 | ||
| excel_file = Path(sample_paper_data) / output_filename | ||
| assert excel_file.exists() | ||
|
|
||
| # 读取 Excel 验证内容 | ||
| df_from_excel = pd.read_excel(excel_file, sheet_name='相似度分析') | ||
| assert len(df_from_excel) == 2 | ||
|
|
||
| # 验证汇总统计表 | ||
| df_summary = pd.read_excel(excel_file, sheet_name='汇总统计') | ||
| assert len(df_summary) == 7 # 6个字段 + 1个总体准确率 | ||
| assert '字段' in df_summary.columns | ||
| assert '平均相似度' in df_summary.columns | ||
| assert df_summary.iloc[-1]['字段'] == '总体准确率' | ||
|
|
||
| def test_invalid_type(self, temp_output_dir): | ||
| """测试无效的数据类型""" | ||
| with pytest.raises(ValueError, match="不支持的数据类型"): | ||
| write_similarity_to_excel( | ||
| type='invalid_type', | ||
| output_dir=temp_output_dir | ||
| ) | ||
|
|
||
| def test_nonexistent_directory(self): | ||
| """测试不存在的目录""" | ||
| with pytest.raises(ValueError, match="输出目录不存在"): | ||
| write_similarity_to_excel( | ||
| type='paper', | ||
| output_dir='/nonexistent/directory' | ||
| ) | ||
|
|
||
| def test_no_jsonl_files(self, temp_output_dir): | ||
| """测试目录中没有 jsonl 文件""" | ||
| with pytest.raises(ValueError, match="未找到任何.jsonl文件"): | ||
| write_similarity_to_excel( | ||
| type='paper', | ||
| output_dir=temp_output_dir | ||
| ) | ||
|
|
||
| def test_default_filename(self, sample_paper_data): | ||
| """测试默认文件名生成""" | ||
| write_similarity_to_excel( | ||
| type='paper', | ||
| output_dir=sample_paper_data | ||
| ) | ||
|
|
||
| # 查找生成的文件 | ||
| output_path = Path(sample_paper_data) | ||
| excel_files = list(output_path.glob("similarity_paper_*.xlsx")) | ||
| assert len(excel_files) > 0 | ||
|
|
||
| def test_data_sorting(self, sample_paper_data): | ||
| """测试数据按 sha256 排序""" | ||
| df = write_similarity_to_excel( | ||
| type='paper', | ||
| output_dir=sample_paper_data, | ||
| output_filename="test_sorted.xlsx" | ||
| ) | ||
|
|
||
| # 验证排序 | ||
| sha256_list = df['sha256'].tolist() | ||
| assert sha256_list == sorted(sha256_list) | ||
|
|
||
| def test_all_string_type(self, sample_paper_data): | ||
| """测试所有列都是字符串类型""" | ||
| df = write_similarity_to_excel( | ||
| type='paper', | ||
| output_dir=sample_paper_data, | ||
| output_filename="test_types.xlsx" | ||
| ) | ||
|
|
||
| # 验证所有列都是字符串类型 | ||
| for col in df.columns: | ||
| assert df[col].dtype == 'object' # pandas 中字符串类型显示为 object | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| pytest.main([__file__, '-v', '--tb=short']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当前的测试用例覆盖了很多场景,做得很好。但是,有两个在 write_similarity_to_excel 函数中发现的潜在问题没有被测试覆盖:
- 递归文件搜索: 测试没有验证函数是否能递归地在子目录中查找
.jsonl文件。当前的sample_paper_datafixture 只在顶层目录创建文件。建议增加一个测试用例,在子目录中创建.jsonl文件,并验证rglob是否能正确找到它。 eval_details的键: 测试数据中eval_details的键被硬编码为'default'。write_similarity_to_excel函数中也硬编码了对'default'的查找,这可能是一个bug。建议增加一个测试用例,使用一个不同于'default'的键(例如,当评估配置中指定了fields时),来确保函数在这种情况下也能正常工作。
这将帮助确保修复后的代码的健壮性。
No description provided.