PDF 转换后图表的文字、数字全部消失 #635

zhEdward · 2025-02-16T15:26:19Z

在提问之前...

我已经搜索了现有的 issues
我在提问题之前至少花费了 5 分钟来思考和准备
我已经认真且完整的阅读了 wiki
我已经认真检查了问题和网络环境无关

使用的环境

- OS:MacOS 11.7.10 
- Python:3.10.16
- pdf2zh:1.9.0

描述你的问题

是使用 pyenv virtualenv 3.10.16 env-31016 创建虚拟环境中运行pdf2zh

测试的PDF是一份仪表测试报告，并非全英文（带有部分汉字）
本地使用命令行工具翻译 PDF 发现转换后表格、图标的英文、数字全都消失（看log里像是报了错误）
在 README 提供的 免费服务 上传测试也是同样问题

如何复现

执行 pdf2zh ~/Downloads/New_USM_1_30Jul24_1.pdf -s google -li en -lo zh

预期行为

No response

相关 Logs

Namespace(files=['/Users/edward/Downloads/New_USM_1_30Jul24_1.pdf'], debug=False, pages=None, vfont='', vchar='', lang_in='en', lang_out='zh', service='google', output='', thread=4, interactive=False, share=False, flask=False, celery=False, authorized=None, prompt=None, compatible=False, onnx=None, serverport=None, dir=False, config=None, yadt=False)
100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.15s/it]
../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:276:exception_info(): exception_info:
../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:277:exception_info(): Traceback (most recent call last):
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/utils.py", line 5832, in build_subset
    fts.main(args)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3786, in main
    font = load_font(
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3628, in load_font
    f = font["post"]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 465, in __getitem__
    table = self._readTable(tag)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 472, in _readTable
    data = self.reader[tag]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/sfnt.py", line 109, in __getitem__
    entry = self.tables[Tag(tag)]
KeyError: 'post'

../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:276:exception_info(): exception_info:
../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:277:exception_info(): Traceback (most recent call last):
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/utils.py", line 5832, in build_subset
    fts.main(args)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3786, in main
    font = load_font(
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3628, in load_font
    f = font["post"]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 465, in __getitem__
    table = self._readTable(tag)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 472, in _readTable
    data = self.reader[tag]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/sfnt.py", line 109, in __getitem__
    entry = self.tables[Tag(tag)]
KeyError: 'post'

原始PDF文件

New_USM_1_30Jul24_1.pdf

还有别的吗？

env-31016 ~/Desktop/github-PDFMathTranslate
pip list
Package                   Version
------------------------- -----------
aiofiles                  23.2.1
annotated-types           0.7.0
anyio                     4.8.0
argostranslate            1.9.6
azure-ai-translation-text 1.0.1
azure-core                1.32.0
bitarray                  3.0.0
bitstring                 4.3.0
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
click                     8.1.8
click-default-group       1.2.4
coloredlogs               15.0.1
ConfigArgParse            1.7
cryptography              44.0.1
ctranslate2               4.3.1
deepl                     1.21.0
Deprecated                1.2.18
distro                    1.9.0
docformatter              1.7.5
exceptiongroup            1.2.2
fastapi                   0.115.8
ffmpy                     0.5.0
filelock                  3.17.0
flatbuffers               25.2.10
fonttools                 4.56.0
fsspec                    2025.2.0
gradio                    5.16.0
gradio_client             1.7.0
gradio_pdf                0.0.22
h11                       0.14.0
httpcore                  1.0.7
httpx                     0.28.1
huggingface-hub           0.28.1
humanfriendly             10.0
idna                      3.10
isodate                   0.7.2
Jinja2                    3.1.5
jiter                     0.8.2
joblib                    1.4.2
lxml                      5.3.1
markdown-it-py            3.0.0
MarkupSafe                2.1.5
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.4.2
numpy                     1.26.4
ollama                    0.4.7
onnx                      1.17.0
onnxruntime               1.19.2
openai                    1.63.0
opencv-python             4.11.0.86
opencv-python-headless    4.11.0.86
orjson                    3.10.15
packaging                 24.2
pandas                    2.2.3
pdf2zh                    1.9.0
pdfminer.six              20240706
peewee                    3.17.9
pikepdf                   9.5.2
pillow                    11.1.0
pip                       25.0.1
protobuf                  5.29.3
pycparser                 2.22
pydantic                  2.10.6
pydantic_core             2.27.2
pydub                     0.25.1
Pygments                  2.19.1
PyMuPDF                   1.25.3
python-dateutil           2.9.0.post0
python-multipart          0.0.20
pytz                      2025.1
PyYAML                    6.0.2
regex                     2024.11.6
requests                  2.32.3
rich                      13.9.4
ruff                      0.9.6
sacremoses                0.0.53
safehttpx                 0.1.6
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                65.5.0
shellingham               1.5.4
six                       1.17.0
sniffio                   1.3.1
stanza                    1.1.1
starlette                 0.45.3
sympy                     1.13.3
tenacity                  9.0.0
tencentcloud-sdk-python   3.0.1319
toml                      0.10.2
tomlkit                   0.13.2
toposort                  1.10
torch                     2.2.2
tqdm                      4.67.1
typer                     0.15.1
typing_extensions         4.12.2
tzdata                    2025.1
untokenize                0.1.1
urllib3                   2.3.0
uvicorn                   0.34.0
websockets                14.2
wrapt                     1.17.2
xinference-client         1.2.2
xsdata                    24.12
yadt                      0.0.1a28`

说明
我把numpy降到1.26.4，会出现下方的提示。执行 pdf2zh 的报错信息就是前面 相关 Logs 贴出来的

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yadt 0.0.1a28 requires numpy>=2.0.2, but you have numpy 1.26.4 which is incompatible.

如果我更新到numpy-2.2.3 再次执行 pdf2zh 出现下面log报错且 PDF也是空白

pdf2zh ~/Downloads/New_USM_1_30Jul24_1.pdf -s google -li en -lo zh

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Users/edward/.pyenv/versions/env-31016/bin/pdf2zh", line 5, in <module>
    from pdf2zh.pdf2zh import main
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/__init__.py", line 2, in <module>
    from pdf2zh.high_level import translate, translate_stream
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/high_level.py", line 24, in <module>
    from pdf2zh.converter import TranslateConverter
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/converter.py", line 22, in <module>
    from pdf2zh.translator import (
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/translator.py", line 20, in <module>
    import argostranslate.translate
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/argostranslate/translate.py", line 5, in <module>
    import ctranslate2
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/__init__.py", line 55, in <module>
    from ctranslate2 import converters, models, specs
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/converters/__init__.py", line 1, in <module>
    from ctranslate2.converters.converter import Converter
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 8, in <module>
    from ctranslate2.specs.model_spec import ACCEPTED_MODEL_TYPES, ModelSpec
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/__init__.py", line 1, in <module>
    from ctranslate2.specs.attention_spec import RotaryScalingType
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/attention_spec.py", line 5, in <module>
    from ctranslate2.specs import common_spec, model_spec
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/common_spec.py", line 3, in <module>
    from ctranslate2.specs import model_spec
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 18, in <module>
    import torch
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/__init__.py", line 1477, in <module>
    from .functional import *  # noqa: F403
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

转换出错的文档
New_USM_1_30Jul24_1-dual.pdf
New_USM_1_30Jul24_1-mono.pdf

The text was updated successfully, but these errors were encountered:

awwaawwa · 2025-02-16T15:28:58Z

The new backend also has this issue. Will analyze when there's time.

duofengzhiling · 2025-02-25T15:35:06Z

1.9.1翻译之后一片空白，1.8.8还能用。怀疑字符不显示的问题是字体原因。

duofengzhiling · 2025-02-25T15:39:04Z

File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\pymupdf\utils.py", line 5698, in build_subset
fts.main(args)
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\misc\loggingTools.py", line 375, in wrapper
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\subset_init_.py", line 3786, in main
font = load_font(
^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\misc\loggingTools.py", line 375, in wrapper
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\subset_init_.py", line 3628, in load_font
f = font["post"]
~~~~^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\ttFont.py", line 465, in getitem
table = self._readTable(tag)
^^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\ttFont.py", line 472, in _readTable
data = self.reader[tag]
~~~~~~~~~~~^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\sfnt.py", line 110, in getitem
data = entry.loadData(self.file)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\sfnt.py", line 508, in loadData
assert len(data) == self.length

awwaawwa · 2025-02-25T15:42:33Z

@duofengzhiling #678 等正式版发版吧。这个选项可以修复字体子集化的问题。

zhEdward added the bug Something isn't working label Feb 16, 2025

awwaawwa added the Normal priority label Feb 16, 2025

awwaawwa mentioned this issue Feb 16, 2025

Bug: Bad Case from GitHub downstream project funstory-ai/BabelDOC#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF 转换后图表的文字、数字全部消失 #635

PDF 转换后图表的文字、数字全部消失 #635

zhEdward commented Feb 16, 2025

awwaawwa commented Feb 16, 2025

duofengzhiling commented Feb 25, 2025

duofengzhiling commented Feb 25, 2025

awwaawwa commented Feb 25, 2025

PDF 转换后图表的文字、数字全部消失 #635

PDF 转换后图表的文字、数字全部消失 #635

Comments

zhEdward commented Feb 16, 2025

在提问之前...

使用的环境

描述你的问题

如何复现

预期行为

相关 Logs

原始PDF文件

还有别的吗？

awwaawwa commented Feb 16, 2025

duofengzhiling commented Feb 25, 2025

duofengzhiling commented Feb 25, 2025

awwaawwa commented Feb 25, 2025