Skip to content

Conversation

@shijinpjlab
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @shijinpjlab, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the Dingo framework's evaluation capabilities by integrating the ScienceMetaBench dataset. It introduces a structured approach to assess the quality of extracted metadata across various document types, offering flexible result storage and advanced reporting features. The changes aim to provide users with powerful tools for benchmarking and analyzing metadata extraction performance.

Highlights

  • ScienceMetaBench Integration: Introduced comprehensive support for evaluating metadata extraction accuracy using the ScienceMetaBench dataset, including specific rules for academic papers, ebooks, and textbooks.
  • Configurable Result Merging: Added a new merge option to result saving configurations, allowing all evaluation outputs to be consolidated into a single JSONL file for easier processing.
  • Excel Export Utility: Implemented a utility function to export detailed and summarized evaluation results to an Excel file, providing both field-level and overall accuracy statistics.
  • String Similarity Algorithm: Developed a robust string similarity function, string_similarity, which handles null values, case-insensitivity, and uses SequenceMatcher for accurate comparisons.
  • Documentation and Examples: Provided new English and Chinese documentation, along with an example script, to guide users on how to leverage the ScienceMetaBench evaluation capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

此拉取请求增加了对 ScienceMetaBench 基准测试的支持,包括新的评估规则、数据处理和结果导出功能。代码结构清晰,并包含了完善的文档和测试。我的审查主要关注于提高代码的正确性和可维护性。主要建议包括:修复结果保存逻辑中的一个错误,重构评估规则中重复的代码,以及增强 Excel 导出功能的稳健性。此外,我还指出了文档中的一个拼写错误,并建议改进测试套件以覆盖更多边界情况。

Comment on lines 253 to 261
eval_details = dingo_result.get('eval_details', {})
default_details = eval_details.get('default', [])

# 获取相似度字典
similarity_dict = {}
if default_details and len(default_details) > 0:
reason_list = default_details[0].get('reason', [])
if reason_list and len(reason_list) > 0:
similarity_dict = reason_list[0].get('similarity', {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

dingo_result 中提取相似度字典时,代码硬编码查找 eval_details 中的 'default' 键。然而,当评估配置中指定了 fields 时,这个键会是字段名的组合,而不是 'default'。这会导致在这些情况下无法提取相似度数据。建议修改此逻辑,使其能通用地从 eval_details 的值中查找相似度字典。

Suggested change
eval_details = dingo_result.get('eval_details', {})
default_details = eval_details.get('default', [])
# 获取相似度字典
similarity_dict = {}
if default_details and len(default_details) > 0:
reason_list = default_details[0].get('reason', [])
if reason_list and len(reason_list) > 0:
similarity_dict = reason_list[0].get('similarity', {})
eval_details = dingo_result.get('eval_details', {})
# 获取相似度字典
similarity_dict = {}
for details_list in eval_details.values():
if details_list and len(details_list) > 0:
reason_list = details_list[0].get('reason', [])
if reason_list and len(reason_list) > 0 and 'similarity' in reason_list[0]:
similarity_dict = reason_list[0].get('similarity', {})
break # 找到相似度字典后即可退出循环

Comment on lines +290 to +294
# if input_args.executor.result_save.raw:
# str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False)
# else:
# str_json = json.dumps(result_info.to_dict(), ensure_ascii=False)
str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

merge 模式下,当前代码硬编码使用 result_info.to_raw_dict(),忽略了 input_args.executor.result_save.raw 配置。这与非合并模式下的行为不一致。建议恢复被注释掉的逻辑,以正确处理 raw 标志。

Suggested change
# if input_args.executor.result_save.raw:
# str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False)
# else:
# str_json = json.dumps(result_info.to_dict(), ensure_ascii=False)
str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False)
if input_args.executor.result_save.raw:
str_json = json.dumps(result_info.to_raw_dict(), ensure_ascii=False)
else:
str_json = json.dumps(result_info.to_dict(), ensure_ascii=False)

Comment on lines 166 to 186
class RuleMetadataMatchTextbook(RuleMetadataMatchBase):
"""
检查教科书(Textbook)元数据字段的相似度匹配
比较 benchmark 和 product 字段中的各个子字段,包括:
isbn, title, author, abstract, category, pub_time, publisher
阈值为 0.6,只有所有字段的相似度都达到阈值才算通过
"""

_metric_info = {
"category": "Rule-Based Metadata Quality Metrics",
"quality_dimension": "EFFECTIVENESS",
"metric_name": "RuleMetadataMatchTextbook",
"description": "检查教科书元数据字段与基准数据的相似度匹配,阈值为0.6",
}

dynamic_config = EvaluatorRuleArgs(
key_list=['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'],
threshold=0.6
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

RuleMetadataMatchTextbookRuleMetadataMatchEbook 的实现几乎完全相同,特别是 dynamic_config。为了减少代码重复并提高可维护性,可以考虑为它们创建一个共同的基类。

例如:

class RuleMetadataMatchBookBase(RuleMetadataMatchBase):
    """电子书和教科书元数据匹配的基类"""
    dynamic_config = EvaluatorRuleArgs(
        key_list=['isbn', 'title', 'author', 'abstract', 'category', 'pub_time', 'publisher'],
        threshold=0.6
    )

@Model.rule_register("QUALITY_BAD_EFFECTIVENESS", ["sciencemetabench"])
class RuleMetadataMatchEbook(RuleMetadataMatchBookBase):
    _metric_info = {
        "category": "Rule-Based Metadata Quality Metrics",
        "quality_dimension": "EFFECTIVENESS",
        "metric_name": "RuleMetadataMatchEbook",
        "description": "检查电子书元数据字段与基准数据的相似度匹配,阈值为0.6",
    }

@Model.rule_register("QUALITY_BAD_EFFECTIVENESS", ["sciencemetabench"])
class RuleMetadataMatchTextbook(RuleMetadataMatchBookBase):
    _metric_info = {
        "category": "Rule-Based Metadata Quality Metrics",
        "quality_dimension": "EFFECTIVENESS",
        "metric_name": "RuleMetadataMatchTextbook",
        "description": "检查教科书元数据字段与基准数据的相似度匹配,阈值为0.6",
    }


- **`benchmark`**: Ground truth (standard answer)
- **Source**: Obtained from [ScienceMetaBench Dataset](https://huggingface.co/datasets/opendatalab/ScienceMetaBench)
- **Included Fields**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这里的冒号是全角字符(),在英文文档中应使用半角冒号(:)。

Suggested change
- **Included Fields**
- **Included Fields**:

Comment on lines 333 to 542
class TestWriteSimilarityToExcel:
"""测试 write_similarity_to_excel 函数"""

@pytest.fixture
def temp_output_dir(self):
"""创建临时输出目录"""
temp_dir = tempfile.mkdtemp()
yield temp_dir
# 清理
shutil.rmtree(temp_dir, ignore_errors=True)

@pytest.fixture
def sample_paper_data(self, temp_output_dir):
"""创建示例 paper 数据"""
data = [
{
"sha256": "test001",
"benchmark": {
"doi": "10.1234/test001",
"title": "Test Paper 1",
"author": "Author 1",
"keyword": "keyword1",
"abstract": "Abstract 1",
"pub_time": "2024"
},
"product": {
"doi": "10.1234/test001",
"title": "Test Paper 1",
"author": "Author 1",
"keyword": "keyword1",
"abstract": "Abstract 1",
"pub_time": "2024"
},
"dingo_result": {
"eval_status": True,
"eval_details": {
"default": [
{
"metric": "RuleMetadataMatchPaper",
"status": True,
"label": ["QUALITY_GOOD"],
"reason": [
{
"similarity": {
"doi": 1.0,
"title": 1.0,
"author": 1.0,
"keyword": 1.0,
"abstract": 1.0,
"pub_time": 1.0
}
}
]
}
]
}
}
},
{
"sha256": "test002",
"benchmark": {
"doi": "10.1234/test002",
"title": "Test Paper 2",
"author": "Author 2",
"keyword": "keyword2",
"abstract": "Abstract 2",
"pub_time": "2024"
},
"product": {
"doi": "",
"title": "Different Title",
"author": "Author 2",
"keyword": "keyword2",
"abstract": "Different Abstract",
"pub_time": "2024"
},
"dingo_result": {
"eval_status": True,
"eval_details": {
"default": [
{
"metric": "RuleMetadataMatchPaper",
"status": True,
"label": ["QUALITY_BAD_EFFECTIVENESS.RuleMetadataMatchPaper.doi"],
"reason": [
{
"similarity": {
"doi": 0.0,
"title": 0.5,
"author": 1.0,
"keyword": 1.0,
"abstract": 0.45,
"pub_time": 1.0
}
}
]
}
]
}
}
}
]

# 写入 jsonl 文件
jsonl_file = Path(temp_output_dir) / "test_result.jsonl"
with open(jsonl_file, 'w', encoding='utf-8') as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + '\n')

return temp_output_dir

def test_write_paper_excel(self, sample_paper_data):
"""测试导出 paper 类型的 Excel"""
output_filename = "test_paper.xlsx"

df = write_similarity_to_excel(
type='paper',
output_dir=sample_paper_data,
output_filename=output_filename
)

# 验证返回的 DataFrame
assert df is not None
assert len(df) == 2
assert 'sha256' in df.columns

# 验证所有 paper 字段都存在
for field in ['doi', 'title', 'author', 'keyword', 'abstract', 'pub_time']:
assert f'benchmark_{field}' in df.columns
assert f'product_{field}' in df.columns
assert f'similarity_{field}' in df.columns

# 验证 Excel 文件是否创建
excel_file = Path(sample_paper_data) / output_filename
assert excel_file.exists()

# 读取 Excel 验证内容
df_from_excel = pd.read_excel(excel_file, sheet_name='相似度分析')
assert len(df_from_excel) == 2

# 验证汇总统计表
df_summary = pd.read_excel(excel_file, sheet_name='汇总统计')
assert len(df_summary) == 7 # 6个字段 + 1个总体准确率
assert '字段' in df_summary.columns
assert '平均相似度' in df_summary.columns
assert df_summary.iloc[-1]['字段'] == '总体准确率'

def test_invalid_type(self, temp_output_dir):
"""测试无效的数据类型"""
with pytest.raises(ValueError, match="不支持的数据类型"):
write_similarity_to_excel(
type='invalid_type',
output_dir=temp_output_dir
)

def test_nonexistent_directory(self):
"""测试不存在的目录"""
with pytest.raises(ValueError, match="输出目录不存在"):
write_similarity_to_excel(
type='paper',
output_dir='/nonexistent/directory'
)

def test_no_jsonl_files(self, temp_output_dir):
"""测试目录中没有 jsonl 文件"""
with pytest.raises(ValueError, match="未找到任何.jsonl文件"):
write_similarity_to_excel(
type='paper',
output_dir=temp_output_dir
)

def test_default_filename(self, sample_paper_data):
"""测试默认文件名生成"""
write_similarity_to_excel(
type='paper',
output_dir=sample_paper_data
)

# 查找生成的文件
output_path = Path(sample_paper_data)
excel_files = list(output_path.glob("similarity_paper_*.xlsx"))
assert len(excel_files) > 0

def test_data_sorting(self, sample_paper_data):
"""测试数据按 sha256 排序"""
df = write_similarity_to_excel(
type='paper',
output_dir=sample_paper_data,
output_filename="test_sorted.xlsx"
)

# 验证排序
sha256_list = df['sha256'].tolist()
assert sha256_list == sorted(sha256_list)

def test_all_string_type(self, sample_paper_data):
"""测试所有列都是字符串类型"""
df = write_similarity_to_excel(
type='paper',
output_dir=sample_paper_data,
output_filename="test_types.xlsx"
)

# 验证所有列都是字符串类型
for col in df.columns:
assert df[col].dtype == 'object' # pandas 中字符串类型显示为 object


if __name__ == '__main__':
pytest.main([__file__, '-v', '--tb=short'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

当前的测试用例覆盖了很多场景,做得很好。但是,有两个在 write_similarity_to_excel 函数中发现的潜在问题没有被测试覆盖:

  1. 递归文件搜索: 测试没有验证函数是否能递归地在子目录中查找 .jsonl 文件。当前的 sample_paper_data fixture 只在顶层目录创建文件。建议增加一个测试用例,在子目录中创建 .jsonl 文件,并验证 rglob 是否能正确找到它。
  2. eval_details 的键: 测试数据中 eval_details 的键被硬编码为 'default'write_similarity_to_excel 函数中也硬编码了对 'default' 的查找,这可能是一个bug。建议增加一个测试用例,使用一个不同于 'default' 的键(例如,当评估配置中指定了 fields 时),来确保函数在这种情况下也能正常工作。

这将帮助确保修复后的代码的健壮性。

@shijinpjlab shijinpjlab merged commit 320b44f into MigoXLab:dev Jan 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant