Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

支持推理模型进行翻译 #650

Open
highkay opened this issue Feb 19, 2025 · 7 comments · May be fixed by #653
Open

支持推理模型进行翻译 #650

highkay opened this issue Feb 19, 2025 · 7 comments · May be fixed by #653
Labels
enhancement New feature or request

Comments

@highkay
Copy link
Contributor

highkay commented Feb 19, 2025

在什么场景下,需要你请求的功能?

推理模型的翻译质量比原版要高不少

解决方案

主要是groq提供了免费的蒸馏过的推理模型,主要是deepseek-r1-distill-qwen-32b。

我正在开发此功能,代码如下

# 过滤掉<think>标签内的内容
if "<think>" in content and "</think>" in content:
    content = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL)

我添加到了OpenAITranslator的do_translate方法内,然后本地用python pdf2zh.pdf2zh -i -d运行的,问题是没有生效,翻译的内容还是包含了标签,我单独写了一个unittest,跑起来是没问题的,标签被删掉了。

请帮我看一下,解决之后我会发pr的。

其他内容

No response

@highkay highkay added the enhancement New feature or request label Feb 19, 2025
@awwaawwa
Copy link
Collaborator

#637 给ollama做了一个,你参考一下

@awwaawwa
Copy link
Collaborator

grok在代码中可能是单独的一个类?

@highkay
Copy link
Contributor Author

highkay commented Feb 20, 2025

#637 给ollama做了一个,你参考一下

他那个主要代码和我一样的,不过条件不一样,他限定了model,其实蒸馏模型也会输出think标签的,而且没人会用满血的推理模型翻译文章的,太贵了,而且提升有限(我感觉是单次抽取的上下文窗口太短了,天花板太低)。免费的蒸馏模型是非常合适的,我也对比了一下效果,明显比glm4-falsh(相当于chatglm4-9b)强很多,而且速度快的多。

@highkay
Copy link
Contributor Author

highkay commented Feb 20, 2025

grok在代码中可能是单独的一个类?

groq是继承了OpenAITranslator,并没有重写do_translate方法,所以我应该修改OpenAITranslator这个父类的do_translate吧?然后我的unittest也是基于Groq做的Translator,跑出来没问题,没有think标签。

import unittest
from pdf2zh.translator import GroqTranslator
from pdf2zh import cache

class TestGroqTranslator(unittest.TestCase):
    def setUp(self):
        self.test_db = cache.init_test_db()
        # Mock environment variables and config 
        self.test_env = {
            "GROQ_API_KEY": "xxxxxx",
            "GROQ_MODEL": "deepseek-r1-distill-qwen-32b"
        }
        
    def tearDown(self):
        cache.clean_test_db(self.test_db)
        
    def test_do_translate_success(self):

        # Create translator instance
        translator = GroqTranslator(
            lang_in="en",
            lang_out="zh",
            model=None, 
            envs=self.test_env
        )
        
        text = """Get personalized book picks and up-to-date news about this author."""

        # Test translation
        result = translator.do_translate(text)
        
        print(result)


if __name__ == "__main__":
    unittest.main()

@awwaawwa
Copy link
Collaborator

发一个draft的PR,方便大家看到你的代码。

@awwaawwa
Copy link
Collaborator

另外 re.sub(r"<think>.*?</think>" 会把不在响应开头的内容也干掉。正则表达式你参考另一个PR的那个regex。

@highkay
Copy link
Contributor Author

highkay commented Feb 20, 2025

另外 re.sub(r"<think>.*?</think>" 会把不在响应开头的内容也干掉。正则表达式你参考另一个PR的那个regex。

#653

@awwaawwa awwaawwa linked a pull request Feb 20, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants