Skip to content

[fix]: Fix badcases in the simplify stage and improve its robustness.#541

Closed
ideaflow wants to merge 5 commits intoccprocessor:devfrom
ideaflow:dev_simplify
Closed

[fix]: Fix badcases in the simplify stage and improve its robustness.#541
ideaflow wants to merge 5 commits intoccprocessor:devfrom
ideaflow:dev_simplify

Conversation

@ideaflow
Copy link
Collaborator

  1. Table
  • 去掉数据表格的判断逻辑:th, thead, tfoot, summary
  • 修改原有判断逻辑:将"caption", "colgroup", "col"标签和单元格中的"headers"属性由表格节点下的所有元素修改为当前表格节点下且不包括其嵌套表格内的节点
  • 如果一个table是数据表格,那么就将其作为一个单独的item,旧版本则是将其内部的tr/caption等作为一个item
  1. List
    增加列表类型的判断逻辑:
  • 如果列表的直接子节点含有非列表元素,则属于布局列表
  • 如果列表的某一个直接子节点包含块级元素,那么也属于布局列表
    如果是内容列表,则整个列表是一个item_id,如果是布局列表,则在列表内部继续细分item
  1. 将块级元素的判断条件修改为:如果一个元素或者其内部某一个元素是块级元素,那么就将该元素当做块级元素,具体的实现是:在process_node函数中,在遍历当前节点的子节点时,将is_block_element(child)修改为is_block_element(child) or has_block_descendants(child),即如果某个节点本身是块级元素,或者内部包含块级元素,都将其视为块级元素来处理
  2. 修改删除注释的逻辑:将正则表达式匹配修改为依靠lxml.html.HTMLParser的remove_comments=True参数来删除。(有些网页的注释写的不规范,比如只有开头,导致正则匹配的范围错误,会造成误删)
  3. 修改header和footer标签的删除逻辑:修改为只删除的直接子元素下的header和footer标签
  4. 修改class和id名为header、footer或者nav的元素的删除逻辑:只有当某个元素的class或者id刚好名为header或footer,并且属于的直接子元素,才将其删除
  5. 增加一个逻辑:在创建wrapper元素时,如果其父节点包含cc-select=true,那么将wrapper也添加上cc-select=true
  6. 增加一个逻辑:如果一个block_element内部包含cc-select=true的元素,那么将这个block_element也加上cc-select=true属性

@codecov
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 96.00000% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...it/main_html_parser/simplify_html/simplify_html.py 95.95% 4 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #541      +/-   ##
==========================================
- Coverage   90.97%   90.93%   -0.05%     
==========================================
  Files         102      106       +4     
  Lines        8890     9159     +269     
==========================================
+ Hits         8088     8329     +241     
- Misses        802      830      +28     
Files with missing lines Coverage Δ
..._web_kit/main_html_parser/parser/tag_simplifier.py 75.00% <100.00%> (-2.78%) ⬇️
...it/main_html_parser/simplify_html/simplify_html.py 82.02% <95.95%> (-0.99%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ningwenchang added 2 commits September 8, 2025 06:47
@ideaflow ideaflow closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant