Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF 转换后图表的文字、数字全部消失 #635

Open
4 tasks done
zhEdward opened this issue Feb 16, 2025 · 4 comments
Open
4 tasks done

PDF 转换后图表的文字、数字全部消失 #635

zhEdward opened this issue Feb 16, 2025 · 4 comments
Labels
bug Something isn't working Normal priority

Comments

@zhEdward
Copy link

在提问之前...

  • 我已经搜索了现有的 issues
  • 我在提问题之前至少花费了 5 分钟来思考和准备
  • 我已经认真且完整的阅读了 wiki
  • 我已经认真检查了问题和网络环境无关

使用的环境

- OS:MacOS 11.7.10 
- Python:3.10.16
- pdf2zh:1.9.0

描述你的问题

是使用 pyenv virtualenv 3.10.16 env-31016 创建虚拟环境 中运行pdf2zh

  1. 测试的PDF是一份仪表测试报告,并非全英文(带有部分汉字)
  2. 本地使用 命令行工具 翻译 PDF 发现转换后表格、图标的英文、数字全都消失(看log里像是报了错误)
  3. 在 README 提供的 免费服务 上传测试 也是同样问题

如何复现

执行 pdf2zh ~/Downloads/New_USM_1_30Jul24_1.pdf -s google -li en -lo zh

预期行为

No response

相关 Logs

Namespace(files=['/Users/edward/Downloads/New_USM_1_30Jul24_1.pdf'], debug=False, pages=None, vfont='', vchar='', lang_in='en', lang_out='zh', service='google', output='', thread=4, interactive=False, share=False, flask=False, celery=False, authorized=None, prompt=None, compatible=False, onnx=None, serverport=None, dir=False, config=None, yadt=False)
100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.15s/it]
../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:276:exception_info(): exception_info:
../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:277:exception_info(): Traceback (most recent call last):
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/utils.py", line 5832, in build_subset
    fts.main(args)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3786, in main
    font = load_font(
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3628, in load_font
    f = font["post"]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 465, in __getitem__
    table = self._readTable(tag)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 472, in _readTable
    data = self.reader[tag]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/sfnt.py", line 109, in __getitem__
    entry = self.tables[Tag(tag)]
KeyError: 'post'

../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:276:exception_info(): exception_info:
../../.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/__init__.py:277:exception_info(): Traceback (most recent call last):
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pymupdf/utils.py", line 5832, in build_subset
    fts.main(args)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3786, in main
    font = load_font(
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 375, in wrapper
    return func(*args, **kwds)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/subset/__init__.py", line 3628, in load_font
    f = font["post"]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 465, in __getitem__
    table = self._readTable(tag)
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 472, in _readTable
    data = self.reader[tag]
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/fontTools/ttLib/sfnt.py", line 109, in __getitem__
    entry = self.tables[Tag(tag)]
KeyError: 'post'

原始PDF文件

New_USM_1_30Jul24_1.pdf

还有别的吗?

env-31016 ~/Desktop/github-PDFMathTranslate
pip list
Package                   Version
------------------------- -----------
aiofiles                  23.2.1
annotated-types           0.7.0
anyio                     4.8.0
argostranslate            1.9.6
azure-ai-translation-text 1.0.1
azure-core                1.32.0
bitarray                  3.0.0
bitstring                 4.3.0
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
click                     8.1.8
click-default-group       1.2.4
coloredlogs               15.0.1
ConfigArgParse            1.7
cryptography              44.0.1
ctranslate2               4.3.1
deepl                     1.21.0
Deprecated                1.2.18
distro                    1.9.0
docformatter              1.7.5
exceptiongroup            1.2.2
fastapi                   0.115.8
ffmpy                     0.5.0
filelock                  3.17.0
flatbuffers               25.2.10
fonttools                 4.56.0
fsspec                    2025.2.0
gradio                    5.16.0
gradio_client             1.7.0
gradio_pdf                0.0.22
h11                       0.14.0
httpcore                  1.0.7
httpx                     0.28.1
huggingface-hub           0.28.1
humanfriendly             10.0
idna                      3.10
isodate                   0.7.2
Jinja2                    3.1.5
jiter                     0.8.2
joblib                    1.4.2
lxml                      5.3.1
markdown-it-py            3.0.0
MarkupSafe                2.1.5
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.4.2
numpy                     1.26.4
ollama                    0.4.7
onnx                      1.17.0
onnxruntime               1.19.2
openai                    1.63.0
opencv-python             4.11.0.86
opencv-python-headless    4.11.0.86
orjson                    3.10.15
packaging                 24.2
pandas                    2.2.3
pdf2zh                    1.9.0
pdfminer.six              20240706
peewee                    3.17.9
pikepdf                   9.5.2
pillow                    11.1.0
pip                       25.0.1
protobuf                  5.29.3
pycparser                 2.22
pydantic                  2.10.6
pydantic_core             2.27.2
pydub                     0.25.1
Pygments                  2.19.1
PyMuPDF                   1.25.3
python-dateutil           2.9.0.post0
python-multipart          0.0.20
pytz                      2025.1
PyYAML                    6.0.2
regex                     2024.11.6
requests                  2.32.3
rich                      13.9.4
ruff                      0.9.6
sacremoses                0.0.53
safehttpx                 0.1.6
semantic-version          2.10.0
sentencepiece             0.2.0
setuptools                65.5.0
shellingham               1.5.4
six                       1.17.0
sniffio                   1.3.1
stanza                    1.1.1
starlette                 0.45.3
sympy                     1.13.3
tenacity                  9.0.0
tencentcloud-sdk-python   3.0.1319
toml                      0.10.2
tomlkit                   0.13.2
toposort                  1.10
torch                     2.2.2
tqdm                      4.67.1
typer                     0.15.1
typing_extensions         4.12.2
tzdata                    2025.1
untokenize                0.1.1
urllib3                   2.3.0
uvicorn                   0.34.0
websockets                14.2
wrapt                     1.17.2
xinference-client         1.2.2
xsdata                    24.12
yadt                      0.0.1a28`

说明
我把numpy降到1.26.4,会出现下方的提示。执行 pdf2zh 的报错信息就是 前面 相关 Logs 贴出来的

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yadt 0.0.1a28 requires numpy>=2.0.2, but you have numpy 1.26.4 which is incompatible.

如果我更新到numpy-2.2.3 再次执行 pdf2zh 出现下面log报错且 PDF也是空白

pdf2zh ~/Downloads/New_USM_1_30Jul24_1.pdf -s google -li en -lo zh

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Users/edward/.pyenv/versions/env-31016/bin/pdf2zh", line 5, in <module>
    from pdf2zh.pdf2zh import main
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/__init__.py", line 2, in <module>
    from pdf2zh.high_level import translate, translate_stream
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/high_level.py", line 24, in <module>
    from pdf2zh.converter import TranslateConverter
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/converter.py", line 22, in <module>
    from pdf2zh.translator import (
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/pdf2zh/translator.py", line 20, in <module>
    import argostranslate.translate
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/argostranslate/translate.py", line 5, in <module>
    import ctranslate2
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/__init__.py", line 55, in <module>
    from ctranslate2 import converters, models, specs
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/converters/__init__.py", line 1, in <module>
    from ctranslate2.converters.converter import Converter
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 8, in <module>
    from ctranslate2.specs.model_spec import ACCEPTED_MODEL_TYPES, ModelSpec
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/__init__.py", line 1, in <module>
    from ctranslate2.specs.attention_spec import RotaryScalingType
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/attention_spec.py", line 5, in <module>
    from ctranslate2.specs import common_spec, model_spec
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/common_spec.py", line 3, in <module>
    from ctranslate2.specs import model_spec
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 18, in <module>
    import torch
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/__init__.py", line 1477, in <module>
    from .functional import *  # noqa: F403
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/Users/edward/.pyenv/versions/env-31016/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

转换出错的文档
New_USM_1_30Jul24_1-dual.pdf
New_USM_1_30Jul24_1-mono.pdf

@zhEdward zhEdward added the bug Something isn't working label Feb 16, 2025
@awwaawwa
Copy link
Collaborator

The new backend also has this issue. Will analyze when there's time.

@duofengzhiling
Copy link

1.9.1翻译之后一片空白,1.8.8还能用。怀疑字符不显示的问题是字体原因。

@duofengzhiling
Copy link

File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\pymupdf\utils.py", line 5698, in build_subset
fts.main(args)
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\misc\loggingTools.py", line 375, in wrapper
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\subset_init_.py", line 3786, in main
font = load_font(
^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\misc\loggingTools.py", line 375, in wrapper
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\subset_init_.py", line 3628, in load_font
f = font["post"]
~~~~^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\ttFont.py", line 465, in getitem
table = self._readTable(tag)
^^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\ttFont.py", line 472, in _readTable
data = self.reader[tag]
~~~~~~~~~~~^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\sfnt.py", line 110, in getitem
data = entry.loadData(self.file)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "I:\pdf_en2cn\pdf2zhcmd.venv\Lib\site-packages\fontTools\ttLib\sfnt.py", line 508, in loadData
assert len(data) == self.length

@awwaawwa
Copy link
Collaborator

@duofengzhiling #678 等正式版发版吧。这个选项可以修复字体子集化的问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Normal priority
Projects
None yet
Development

No branches or pull requests

3 participants