+
English | 中文
+
+ 所有的代码和文档完全由 OpenAI 的 GPT-4 模型生成 +
+ +## 介绍 + +OpenAI 翻译器是一个使用 AI 技术将英文 PDF 书籍翻译成中文的工具。这个工具使用了大型语言模型 (LLMs),如 ChatGLM 和 OpenAI 的 GPT-3 以及 GPT-3.5 Turbo 来进行翻译。它是用 Python 构建的,并且具有灵活、模块化和面向对象的设计。 + +## 为什么做这个项目 + +在现今的环境中,缺乏非商业而且有效的 PDF 翻译工具。很多用户有包含敏感数据的 PDF 文件,他们更倾向于不将其上传到公共商业服务网站,以保护隐私。这个项目就是为了解决这个问题,为需要翻译他们的 PDF 文件同时又要保护数据隐私的用户提供解决方案。 + +## 示例结果 + +OpenAI 翻译器目前还处于早期开发阶段,我正在积极地添加更多功能和改进其性能。我们非常欢迎任何反馈或贡献! + + + ++ "老人与海" +
+ +## 特性 + +- [X] 使用大型语言模型 (LLMs) 将英文 PDF 书籍翻译成中文。 +- [X] 支持 ChatGLM 和 OpenAI 模型。 +- [X] 通过 YAML 文件或命令行参数灵活配置。 +- [X] 对健壮的翻译操作进行超时和错误处理。 +- [X] 模块化和面向对象的设计,易于定制和扩展。 +- [x] 添加对其他语言和翻译方向的支持。 +- [ ] 实现图形用户界面 (GUI) 以便更易于使用。 +- [ ] 创建一个网络服务或 API,以便在网络应用中使用。 +- [ ] 添加对多个 PDF 文件的批处理支持。 +- [ ] 添加对保留源 PDF 的原始布局和格式的支持。 +- [ ] 通过使用自定义训练的翻译模型来提高翻译质量。 + + +## 开始使用 + +### 环境准备 + +1.克隆仓库 `git clone git@github.com:DjangoPeng/openai-translator.git`。 + +2.OpenAI-翻译器 需要 Python 3.10 或更高版本。使用 `pip install -r requirements.txt` 安装依赖项。 + +3.设置您的 OpenAI API 密钥(`$OPENAI_API_KEY`)。您可以将其添加到环境变量中,或者在 config.yaml 文件中指定。 + +### 使用示例 + +您可以通过指定配置文件或提供命令行参数来使用 OpenAI-Translator 工具。 + +#### 使用配置文件 + +根据您的设置调整 `config.yaml` 文件: + +```yaml +model_name: "gpt-3.5-turbo" +input_file: "tests/test.pdf" +output_file_format: "markdown" +source_language: "English" +target_language: "Chinese" +``` + +然后命令行直接运行: + +```bash +python ai_translator/main.py +``` + + + +#### 使用命令行参数 + +您也可以直接在命令行上指定设置。这是使用 OpenAI 模型的例子: + +```bash +# 将您的 api_key 设置为环境变量 +export OPENAI_API_KEY="sk-xxx" +python ai_translator/main.py --model_name "gpt-3.5-turbo" --input_file "your_input.pdf" --output_file_format "markdown" --source_language "English" --target_language "Chinese" +``` + +## 许可证 + +该项目采用 GPL-3.0 许可证。有关详细信息,请查看 [LICENSE](LICENSE) 文件。 + + + + diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/README.md b/langchain/jupyter/homework/panxizhi/openai-translator/README.md new file mode 100644 index 00000000..db73ec4d --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/README.md @@ -0,0 +1,90 @@ +# OpenAI-Translator + +
+
English | 中文
+
+ All the code and documentation are entirely generated by OpenAI's GPT-4 Model +
+ + +## Introduction + +OpenAI Translator is an AI-powered translation tool designed to translate English PDF books to Chinese. The tool leverages large language models (LLMs) like ChatGLM and OpenAI's GPT-3 and GPT-3.5 Turbo for translation. It's built in Python and has a flexible, modular, and object-oriented design. + +## Why this project + +In the current landscape, there's a lack of non-commercial yet efficient PDF translation tools. Many users have PDF documents with sensitive data that they prefer not to upload to public commercial service websites due to privacy concerns. This project was developed to address this gap, providing a solution for users who need to translate their PDFs while maintaining data privacy. + +### Sample Results + +The OpenAI Translator is still in its early stages of development, and I'm actively working on adding more features and improving its performance. We appreciate any feedback or contributions! + + + ++ "The Old Man and the Sea" +
+ +## Features + +- [X] Translation of English PDF books to Chinese using LLMs. +- [X] Support for both [ChatGLM](https://github.com/THUDM/ChatGLM-6B) and [OpenAI](https://platform.openai.com/docs/models) models. +- [X] Flexible configuration through a YAML file or command-line arguments. +- [X] Timeouts and error handling for robust translation operations. +- [X] Modular and object-oriented design for easy customization and extension. +- [x] Add support for other languages and translation directions. +- [ ] Implement a graphical user interface (GUI) for easier use. +- [ ] Create a web service or API to enable usage in web applications. +- [ ] Add support for batch processing of multiple PDF files. +- [ ] Add support for preserving the original layout and formatting of the source PDF. +- [ ] Improve translation quality by using custom-trained translation models. + + +## Getting Started + +### Environment Setup + +1.Clone the repository `git clone git@github.com:DjangoPeng/openai-translator.git`. + +2.The `OpenAI-Translator` requires Python 3.10 or later. Install the dependencies with `pip install -r requirements.txt`. + +3.Set up your OpenAI API key(`$OPENAI_API_KEY`). You can either add it to your environment variables or specify it in the config.yaml file. + +### Usage + +You can use OpenAI-Translator either by specifying a configuration file or by providing command-line arguments. + +#### Using a configuration file: + +Adapt `config.yaml` file with your settings: + +```yaml +model_name: "gpt-3.5-turbo" +input_file: "tests/test.pdf" +output_file_format: "markdown" +source_language: "English" +target_language: "Chinese" +``` + +Then run the tool: + +```bash +python ai_translator/main.py +``` + + + +#### Using command-line arguments: + +You can also specify the settings directly on the command line. Here's an example of how to use the OpenAI model: + +```bash +# Set your api_key as an env variable +export OPENAI_API_KEY="sk-xxx" +python ai_translator/main.py --model_name "gpt-3.5-turbo" --input_file "your_input.pdf" --output_file_format "markdown" --source_language "English" --target_language "Chinese" +``` + +## License + +This project is licensed under the GPL-3.0 License. See the [LICENSE](LICENSE) file for details. \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/__init__.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/__init__.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/__init__.py new file mode 100644 index 00000000..5b688799 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/__init__.py @@ -0,0 +1,3 @@ +from .book import Book +from .page import Page +from .content import ContentType, Content, TableContent \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/book.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/book.py new file mode 100644 index 00000000..b079357b --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/book.py @@ -0,0 +1,9 @@ +from .page import Page + +class Book: + def __init__(self, pdf_file_path): + self.pdf_file_path = pdf_file_path + self.pages = [] + + def add_page(self, page: Page): + self.pages.append(page) \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/content.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/content.py new file mode 100644 index 00000000..901c2a07 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/content.py @@ -0,0 +1,85 @@ +import pandas as pd + +from enum import Enum, auto +from PIL import Image as PILImage +from utils import LOG +from io import StringIO + +class ContentType(Enum): + TEXT = auto() + TABLE = auto() + IMAGE = auto() + +class Content: + def __init__(self, content_type, original, translation=None): + self.content_type = content_type + self.original = original + self.translation = translation + self.status = False + + def set_translation(self, translation, status): + if not self.check_translation_type(translation): + raise ValueError(f"Invalid translation type. Expected {self.content_type}, but got {type(translation)}") + self.translation = translation + self.status = status + + def check_translation_type(self, translation): + if self.content_type == ContentType.TEXT and isinstance(translation, str): + return True + elif self.content_type == ContentType.TABLE and isinstance(translation, list): + return True + elif self.content_type == ContentType.IMAGE and isinstance(translation, PILImage.Image): + return True + return False + + def __str__(self): + return self.original + + +class TableContent(Content): + def __init__(self, data, translation=None): + df = pd.DataFrame(data) + + # Verify if the number of rows and columns in the data and DataFrame object match + if len(data) != len(df) or len(data[0]) != len(df.columns): + raise ValueError("The number of rows and columns in the extracted table data and DataFrame object do not match.") + + super().__init__(ContentType.TABLE, df) + + def set_translation(self, translation, status): + try: + if not isinstance(translation, str): + raise ValueError(f"Invalid translation type. Expected str, but got {type(translation)}") + + LOG.debug(f"[translation]\n{translation}") + # Extract column names from the first set of brackets + header = translation.split(']')[0][1:].split(', ') + # Extract data rows from the remaining brackets + data_rows = translation.split('] ')[1:] + # Replace Chinese punctuation and split each row into a list of values + data_rows = [row[1:-1].split(', ') for row in data_rows] + # Create a DataFrame using the extracted header and data + translated_df = pd.DataFrame(data_rows, columns=header) + LOG.debug(f"[translated_df]\n{translated_df}") + self.translation = translated_df + self.status = status + except Exception as e: + LOG.error(f"An error occurred during table translation: {e}") + self.translation = None + self.status = False + + def __str__(self): + return self.original.to_string(header=False, index=False) + + def iter_items(self, translated=False): + target_df = self.translation if translated else self.original + for row_idx, row in target_df.iterrows(): + for col_idx, item in enumerate(row): + yield (row_idx, col_idx, item) + + def update_item(self, row_idx, col_idx, new_value, translated=False): + target_df = self.translation if translated else self.original + target_df.at[row_idx, col_idx] = new_value + + def get_original_as_str(self): + return self.original.to_string(header=False, index=False) \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/page.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/page.py new file mode 100644 index 00000000..df12e772 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/book/page.py @@ -0,0 +1,8 @@ +from .content import Content + +class Page: + def __init__(self): + self.contents = [] + + def add_content(self, content: Content): + self.contents.append(content) diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/flask_server.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/flask_server.py new file mode 100644 index 00000000..7b5bed03 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/flask_server.py @@ -0,0 +1,71 @@ +import sys +import os + +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +from flask import Flask, request, send_file, jsonify +from translator import PDFTranslator, TranslationConfig +from utils import ArgumentParser, LOG + +app = Flask(__name__) + +TEMP_FILE_DIR = "flask_temps/" + +@app.route('/translation', methods=['POST']) +def translation(): + try: + input_file = request.files['input_file'] + source_language = request.form.get('source_language', 'English') + target_language = request.form.get('target_language', 'Chinese') + + LOG.debug(f"[input_file]\n{input_file}") + LOG.debug(f"[input_file.filename]\n{input_file.filename}") + + if input_file and input_file.filename: + # # 创建临时文件 + input_file_path = TEMP_FILE_DIR+input_file.filename + LOG.debug(f"[input_file_path]\n{input_file_path}") + + input_file.save(input_file_path) + + # 调用翻译函数 + output_file_path = Translator.translate_pdf( + input_file=input_file_path, + source_language=source_language, + target_language=target_language) + + # 移除临时文件 + # os.remove(input_file_path) + + # 构造完整的文件路径 + output_file_path = os.getcwd() + "/" + output_file_path + LOG.debug(output_file_path) + + # 返回翻译后的文件 + return send_file(output_file_path, as_attachment=True) + except Exception as e: + response = { + 'status': 'error', + 'message': str(e) + } + return jsonify(response), 400 + + +def initialize_translator(): + # 解析命令行 + argument_parser = ArgumentParser() + args = argument_parser.parse_arguments() + + # 初始化配置单例 + config = TranslationConfig() + config.initialize(args) + # 实例化 PDFTranslator 类,并调用 translate_pdf() 方法 + global Translator + Translator = PDFTranslator(config.model_name) + + +if __name__ == "__main__": + # 初始化 translator + initialize_translator() + # 启动 Flask Web Server + app.run(host="0.0.0.0", port=5000, debug=True) \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/gradio_server.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/gradio_server.py new file mode 100644 index 00000000..67d4e846 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/gradio_server.py @@ -0,0 +1,80 @@ +import sys +import os +import gradio as gr + +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +from utils import ArgumentParser, LOG +from translator import PDFTranslator, TranslationConfig + +styleMap = { + "小说": "Novel", + "新闻稿": "News", + "李白风格": "LiBai-Style", +} + +last_model_name = "" +translators = {} + +def translation(input_file, style_name, fiel_format, model_name, source_language, target_language): + if input_file is None or model_name is None or style_name is None or fiel_format is None or source_language is None or target_language is None: + LOG.debug("未上传文件") + return None + + LOG.debug(f"[翻译任务]\n风格:{style_name}\n输出格式:{fiel_format}\n源文件: {input_file.name}\n源语言: {source_language}\n目标语言: {target_language}") + + global Translator, last_model_name , translators + if last_model_name != model_name: + last_model_name = model_name + if translators.get(model_name) is None: + translators[model_name] = PDFTranslator(model_name) + Translator = translators[model_name] + else: + Translator = translators[model_name] + + style_id = styleMap[style_name] + output_file_path = Translator.translate_pdf( + input_file.name,translate_style=style_id, output_file_format=fiel_format, source_language=source_language, target_language=target_language) + + return output_file_path + +def launch_gradio(): + + iface = gr.Interface( + fn=translation, + title="OpenAI-Translator v2.0(PDF 电子书翻译工具)", + inputs=[ + gr.File(label="上传PDF文件"), + gr.Dropdown(label="翻译风格", choices=["小说", "新闻稿","李白风格"], value="小说"), + gr.Dropdown(label="输出格式", choices=["PDF","Markdown"], value="PDF"), + gr.Dropdown(label="模型", choices=["gpt-3.5-turbo", "chatglm6b"], value="gpt-3.5-turbo"), + gr.Textbox(label="源语言(默认:英文)", placeholder="English", value="English"), + gr.Textbox(label="目标语言(默认:中文)", placeholder="Chinese", value="Chinese") + ], + outputs=[ + gr.File(label="下载翻译文件") + ], + allow_flagging="never", + live=True + ) + + iface.launch(server_name="0.0.0.0") + +def initialize_translator(): + # 解析命令行 + argument_parser = ArgumentParser() + args = argument_parser.parse_arguments() + + # 初始化配置单例 + config = TranslationConfig() + config.initialize(args) + # 实例化 PDFTranslator 类,并调用 translate_pdf() 方法 + global Translator + Translator = PDFTranslator(config.model_name) + + +if __name__ == "__main__": + # 初始化 translator + initialize_translator() + # 启动 Gradio 服务 + launch_gradio() diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/main.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/main.py new file mode 100644 index 00000000..33ae7282 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/main.py @@ -0,0 +1,20 @@ +import sys +import os + +sys.path.append(os.path.dirname(os.path.abspath(__file__))) + +from utils import ArgumentParser, LOG +from translator import PDFTranslator, TranslationConfig + +if __name__ == "__main__": + # 解析命令行 + argument_parser = ArgumentParser() + args = argument_parser.parse_arguments() + + # 初始化配置单例 + config = TranslationConfig() + config.initialize(args) + + # 实例化 PDFTranslator 类,并调用 translate_pdf() 方法 + translator = PDFTranslator(config.model_name) + translator.translate_pdf(config.input_file, config.output_file_format, pages=None) diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/__init__.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/__init__.py new file mode 100644 index 00000000..0e3fdcca --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/__init__.py @@ -0,0 +1,2 @@ +from .pdf_translator import PDFTranslator +from .translation_config import TranslationConfig \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/exceptions.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/exceptions.py new file mode 100644 index 00000000..4f4c23c1 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/exceptions.py @@ -0,0 +1,5 @@ +class PageOutOfRangeException(Exception): + def __init__(self, book_pages, requested_pages): + self.book_pages = book_pages + self.requested_pages = requested_pages + super().__init__(f"Page out of range: Book has {book_pages} pages, but {requested_pages} pages were requested.") diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/pdf_parser.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/pdf_parser.py new file mode 100644 index 00000000..6f2f9bc3 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/pdf_parser.py @@ -0,0 +1,58 @@ +import pdfplumber +from typing import Optional +from book import Book, Page, Content, ContentType, TableContent +from translator.exceptions import PageOutOfRangeException +from utils import LOG + + +class PDFParser: + def __init__(self): + pass + + def parse_pdf(self, pdf_file_path: str, pages: Optional[int] = None) -> Book: + book = Book(pdf_file_path) + + with pdfplumber.open(pdf_file_path) as pdf: + if pages is not None and pages > len(pdf.pages): + raise PageOutOfRangeException(len(pdf.pages), pages) + + if pages is None: + pages_to_parse = pdf.pages + else: + pages_to_parse = pdf.pages[:pages] + + for pdf_page in pages_to_parse: + page = Page() + + # Store the original text content + raw_text = pdf_page.extract_text() + tables = pdf_page.extract_tables() + + # Remove each cell's content from the original text + for table_data in tables: + for row in table_data: + for cell in row: + raw_text = raw_text.replace(cell, "", 1) + + # Handling text + if raw_text: + # Remove empty lines and leading/trailing whitespaces + raw_text_lines = raw_text.splitlines() + cleaned_raw_text_lines = [line.strip() for line in raw_text_lines if line.strip()] + cleaned_raw_text = "\n".join(cleaned_raw_text_lines) + + text_content = Content(content_type=ContentType.TEXT, original=cleaned_raw_text) + page.add_content(text_content) + LOG.debug(f"[raw_text]\n {cleaned_raw_text}") + + + + # Handling tables + if tables: + table = TableContent(tables) + page.add_content(table) + LOG.debug(f"[table]\n{table}") + + book.add_page(page) + + return book diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/pdf_translator.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/pdf_translator.py new file mode 100644 index 00000000..a4fe9f12 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/pdf_translator.py @@ -0,0 +1,30 @@ +from typing import Optional +from translator.pdf_parser import PDFParser +from translator.writer import Writer +from translator.translation_chain import TranslationChain +from utils import LOG + +class PDFTranslator: + def __init__(self, model_name: str): + self.translate_chain = TranslationChain(model_name) + self.pdf_parser = PDFParser() + self.writer = Writer() + + def translate_pdf(self, + input_file: str, + translate_style: str = None, + output_file_format: str = 'markdown', + source_language: str = "English", + target_language: str = 'Chinese', + pages: Optional[int] = None): + + self.book = self.pdf_parser.parse_pdf(input_file, pages) + + for page_idx, page in enumerate(self.book.pages): + for content_idx, content in enumerate(page.contents): + # Translate content.original + translation, status = self.translate_chain.run(content, source_language, target_language, translation_style=translate_style) + # Update the content in self.book.pages directly + self.book.pages[page_idx].contents[content_idx].set_translation(translation, status) + + return self.writer.save_translated_book(self.book, output_file_format) diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/translation_chain.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/translation_chain.py new file mode 100644 index 00000000..f3886ec4 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/translation_chain.py @@ -0,0 +1,62 @@ +from langchain.chat_models import ChatOpenAI +from langchain.llms import ChatGLM +from langchain.chains import LLMChain + +from langchain.prompts.chat import ( + ChatPromptTemplate, + SystemMessagePromptTemplate, + HumanMessagePromptTemplate, +) + +from utils import LOG + +def createLLM(model_name: str = "gpt-3.5-turbo", verbose: bool = True): + if model_name == 'chatglm6b' : + chat = ChatGLM(endpoint_url='http://127.0.0.1:8000', + max_token=80000, + history=[], + top_p=0.9, + model_kwargs={"sample_model_args": False }, verbose=verbose) + else: + chat = ChatOpenAI(model_name=model_name, temperature=0, verbose=verbose) + + return chat + +class TranslationChain: + def __init__(self, model_name: str = "gpt-3.5-turbo", verbose: bool = True): + + # 翻译任务指令始终由 System 角色承担 + template = ( + """You are a translation expert, proficient in various languages. You're good at language style of {style_name}.\n + Translates {source_language} to {target_language}.""" + ) + system_message_prompt = SystemMessagePromptTemplate.from_template(template) + + # 待翻译文本由 Human 角色输入 + human_template = "{text}" + human_message_prompt = HumanMessagePromptTemplate.from_template(human_template) + + # 使用 System 和 Human 角色的提示模板构造 ChatPromptTemplate + chat_prompt_template = ChatPromptTemplate.from_messages( + [system_message_prompt, human_message_prompt] + ) + + # 为了翻译结果的稳定性,将 temperature 设置为 0 + chat = createLLM(model_name, verbose) + + self.chain = LLMChain(llm=chat, prompt=chat_prompt_template, verbose=verbose) + + def run(self, text: str, source_language: str, target_language: str, translation_style: str = "CNN Writer") -> (str, bool): + result = "" + try: + result = self.chain.run({ + "text": text, + "style_name": translation_style, + "source_language": source_language, + "target_language": target_language, + }) + except Exception as e: + LOG.error(f"An error occurred during translation: {e}") + return result, False + + return result, True \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/translation_config.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/translation_config.py new file mode 100644 index 00000000..783823ae --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/translation_config.py @@ -0,0 +1,29 @@ +import yaml + +class TranslationConfig: + _instance = None + + def __new__(cls): + if cls._instance is None: + cls._instance = super(TranslationConfig, cls).__new__(cls) + cls._instance._config = None + return cls._instance + + def initialize(self, args): + with open(args.config_file, "r") as f: + config = yaml.safe_load(f) + + # Use the argparse Namespace to update the configuration + overridden_values = { + key: value for key, value in vars(args).items() if key in config and value is not None + } + config.update(overridden_values) + + # Store the original config dictionary + self._instance._config = config + + def __getattr__(self, name): + # Try to get attribute from _config + if self._instance._config and name in self._instance._config: + return self._instance._config[name] + raise AttributeError(f"'TranslationConfig' object has no attribute '{name}'") \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/writer.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/writer.py new file mode 100644 index 00000000..90b51ed5 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/translator/writer.py @@ -0,0 +1,114 @@ +import os +from reportlab.lib import colors, pagesizes, units +from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle +from reportlab.pdfbase import pdfmetrics +from reportlab.pdfbase.ttfonts import TTFont +from reportlab.platypus import ( + SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak +) + +from book import Book, ContentType +from utils import LOG + +class Writer: + def __init__(self): + pass + + def save_translated_book(self, book: Book, ouput_file_format: str): + LOG.debug(ouput_file_format) + + if ouput_file_format.lower() == "pdf": + output_file_path = self._save_translated_book_pdf(book) + elif ouput_file_format.lower() == "markdown": + output_file_path = self._save_translated_book_markdown(book) + else: + LOG.error(f"不支持文件类型: {ouput_file_format}") + return "" + + LOG.info(f"翻译完成,文件保存至: {output_file_path}") + + return output_file_path + + + def _save_translated_book_pdf(self, book: Book, output_file_path: str = None): + + output_file_path = book.pdf_file_path.replace('.pdf', f'_translated.pdf') + + LOG.info(f"开始导出: {output_file_path}") + + # Register Chinese font + font_path = "../fonts/simsun.ttc" # 请将此路径替换为您的字体文件路径 + pdfmetrics.registerFont(TTFont("SimSun", font_path)) + + # Create a new ParagraphStyle with the SimSun font + simsun_style = ParagraphStyle('SimSun', fontName='SimSun', fontSize=12, leading=14) + + # Create a PDF document + doc = SimpleDocTemplate(output_file_path, pagesize=pagesizes.letter) + styles = getSampleStyleSheet() + story = [] + + # Iterate over the pages and contents + for page in book.pages: + for content in page.contents: + if content.status: + if content.content_type == ContentType.TEXT: + # Add translated text to the PDF + text = content.translation + para = Paragraph(text, simsun_style) + story.append(para) + + elif content.content_type == ContentType.TABLE: + # Add table to the PDF + table = content.translation + table_style = TableStyle([ + ('BACKGROUND', (0, 0), (-1, 0), colors.grey), + ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke), + ('ALIGN', (0, 0), (-1, -1), 'CENTER'), + ('FONTNAME', (0, 0), (-1, 0), 'SimSun'), # 更改表头字体为 "SimSun" + ('FONTSIZE', (0, 0), (-1, 0), 14), + ('BOTTOMPADDING', (0, 0), (-1, 0), 12), + ('BACKGROUND', (0, 1), (-1, -1), colors.beige), + ('FONTNAME', (0, 1), (-1, -1), 'SimSun'), # 更改表格中的字体为 "SimSun" + ('GRID', (0, 0), (-1, -1), 1, colors.black) + ]) + pdf_table = Table(table.values.tolist()) + pdf_table.setStyle(table_style) + story.append(pdf_table) + # Add a page break after each page except the last one + if page != book.pages[-1]: + story.append(PageBreak()) + + # Save the translated book as a new PDF file + doc.build(story) + return output_file_path + + + def _save_translated_book_markdown(self, book: Book, output_file_path: str = None): + output_file_path = book.pdf_file_path.replace('.pdf', f'_translated.md') + + LOG.info(f"开始导出: {output_file_path}") + with open(output_file_path, 'w', encoding='utf-8') as output_file: + # Iterate over the pages and contents + for page in book.pages: + for content in page.contents: + if content.status: + if content.content_type == ContentType.TEXT: + # Add translated text to the Markdown file + text = content.translation + output_file.write(text + '\n\n') + + elif content.content_type == ContentType.TABLE: + # Add table to the Markdown file + table = content.translation + header = '| ' + ' | '.join(str(column) for column in table.columns) + ' |' + '\n' + separator = '| ' + ' | '.join(['---'] * len(table.columns)) + ' |' + '\n' + # body = '\n'.join(['| ' + ' | '.join(row) + ' |' for row in table.values.tolist()]) + '\n\n' + body = '\n'.join(['| ' + ' | '.join(str(cell) for cell in row) + ' |' for row in table.values.tolist()]) + '\n\n' + output_file.write(header + separator + body) + + # Add a page break (horizontal rule) after each page except the last one + if page != book.pages[-1]: + output_file.write('---\n\n') + + return output_file_path \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/__init__.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/__init__.py new file mode 100644 index 00000000..09b16931 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/__init__.py @@ -0,0 +1,2 @@ +from .argument_parser import ArgumentParser +from .logger import LOG \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/argument_parser.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/argument_parser.py new file mode 100644 index 00000000..57684d86 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/argument_parser.py @@ -0,0 +1,15 @@ +import argparse + +class ArgumentParser: + def __init__(self): + self.parser = argparse.ArgumentParser(description='A translation tool that supports translations in any language pair.') + self.parser.add_argument('--config_file', type=str, default='config.yaml', help='Configuration file with model and API settings.') + self.parser.add_argument('--model_name', type=str, help='Name of the Large Language Model.') + self.parser.add_argument('--input_file', type=str, help='PDF file to translate.') + self.parser.add_argument('--output_file_format', type=str, help='The file format of translated book. Now supporting PDF and Markdown') + self.parser.add_argument('--source_language', type=str, help='The language of the original book to be translated.') + self.parser.add_argument('--target_language', type=str, help='The target language for translating the original book.') + + def parse_arguments(self): + args = self.parser.parse_args() + return args diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/logger.py b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/logger.py new file mode 100644 index 00000000..a252b50e --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/ai_translator/utils/logger.py @@ -0,0 +1,32 @@ +from loguru import logger +import os +import sys + +LOG_FILE = "translation.log" +ROTATION_TIME = "02:00" + +class Logger: + def __init__(self, name="translation", log_dir="logs", debug=False): + if not os.path.exists(log_dir): + os.makedirs(log_dir) + log_file_path = os.path.join(log_dir, LOG_FILE) + + # Remove default loguru handler + logger.remove() + + # Add console handler with a specific log level + level = "DEBUG" if debug else "INFO" + logger.add(sys.stdout, level=level) + # Add file handler with a specific log level and timed rotation + logger.add(log_file_path, rotation=ROTATION_TIME, level="DEBUG") + self.logger = logger + +LOG = Logger(debug=True).logger + +if __name__ == "__main__": + log = Logger().logger + + log.debug("This is a debug message.") + log.info("This is an info message.") + log.warning("This is a warning message.") + log.error("This is an error message.") diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/config.yaml b/langchain/jupyter/homework/panxizhi/openai-translator/config.yaml new file mode 100644 index 00000000..3ff6f1ec --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/config.yaml @@ -0,0 +1,5 @@ +model_name: "gpt-3.5-turbo" +input_file: "tests/test.pdf" +output_file_format: "markdown" +source_language: "English" +target_language: "Chinese" \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/fonts/simsun.ttc b/langchain/jupyter/homework/panxizhi/openai-translator/fonts/simsun.ttc new file mode 100644 index 00000000..e64e92ed Binary files /dev/null and b/langchain/jupyter/homework/panxizhi/openai-translator/fonts/simsun.ttc differ diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/images/sample_image_0.png b/langchain/jupyter/homework/panxizhi/openai-translator/images/sample_image_0.png new file mode 100644 index 00000000..9773db46 Binary files /dev/null and b/langchain/jupyter/homework/panxizhi/openai-translator/images/sample_image_0.png differ diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/images/sample_image_1.png b/langchain/jupyter/homework/panxizhi/openai-translator/images/sample_image_1.png new file mode 100644 index 00000000..6af881ac Binary files /dev/null and b/langchain/jupyter/homework/panxizhi/openai-translator/images/sample_image_1.png differ diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/jupyter/flask_client.ipynb b/langchain/jupyter/homework/panxizhi/openai-translator/jupyter/flask_client.ipynb new file mode 100644 index 00000000..b9049398 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/jupyter/flask_client.ipynb @@ -0,0 +1,117 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "a4e32dd8-6716-463e-8120-91682dd65830", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: requests in /root/miniconda3/envs/langchain/lib/python3.10/site-packages (2.31.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/envs/langchain/lib/python3.10/site-packages (from requests) (3.1.0)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/envs/langchain/lib/python3.10/site-packages (from requests) (2.0.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/envs/langchain/lib/python3.10/site-packages (from requests) (2023.5.7)\n", + "Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/envs/langchain/lib/python3.10/site-packages (from requests) (3.4)\n", + "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install requests" + ] + }, + { + "cell_type": "markdown", + "id": "eab5fa3d-3e27-4337-a891-85eda96ddf50", + "metadata": {}, + "source": [ + "# 使用 requests 库请求翻译服务 API" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "596c6c45-8b95-4beb-ae6f-b1c49aa4020c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Translation completed. Translated file saved as translated_output.md.\n" + ] + } + ], + "source": [ + "import requests\n", + "\n", + "# Flask服务器的地址\n", + "FLASK_SERVER_URL = 'http://localhost:5000'\n", + "\n", + "# 翻译服务接口\n", + "translation_url = f'{FLASK_SERVER_URL}/translation'\n", + "\n", + "# 要上传的文件路径\n", + "file_path = '../tests/test.pdf' # 修改为你的文件路径\n", + "\n", + "# 构建请求参数\n", + "params = {\n", + " 'source_language': 'English', # 修改为你的源语言\n", + " 'target_language': 'Chinese' # 修改为你的目标语言\n", + "}\n", + "\n", + "# 发送POST请求\n", + "with open(file_path, 'rb') as file:\n", + " files = {'input_file': file}\n", + " response = requests.post(translation_url, files=files, data=params)\n", + "\n", + "\n", + "# 翻译后文件\n", + "output_filename = \"translated_output.md\"\n", + "\n", + "# 处理响应\n", + "if response.status_code == 200:\n", + " # 保存翻译后的文件\n", + " with open(output_filename, 'wb') as output_file:\n", + " output_file.write(response.content)\n", + " print(f\"Translation completed. Translated file saved as {output_filename}.\")\n", + "else:\n", + " print(f\"Translation failed. Status code: {response.status_code}\")\n", + " print(response.text)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2da6286a-d92c-4eaf-abec-9f74cb7af599", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/jupyter/translation_chain.ipynb b/langchain/jupyter/homework/panxizhi/openai-translator/jupyter/translation_chain.ipynb new file mode 100644 index 00000000..c405ab58 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/jupyter/translation_chain.ipynb @@ -0,0 +1,611 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "69cb775b-eb44-4e95-83c2-c087947d7ae4", + "metadata": {}, + "source": [ + "# LangChain 核心模块 - Chat Model 和 Chat Prompt Template\n", + "\n", + "希望通过此示例,让大家深入理解 LangChain 的聊天模型。简而言之:\n", + "- `Chat Model` 不止是一个用于聊天对话的模型抽象,更重要的是提供了`多角色`提示能力(System,AI,Human,Function)。\n", + "- `Chat Prompt Template` 则为开发者提供了便捷维护`不同角色`的`提示模板`与`消息记录`的接口。\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2219b200-bb76-45b7-bb88-0525667c1b31", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "8a376a3d-ce1f-4aab-9a2a-6224fce2edfd", + "metadata": {}, + "source": [ + "## 温故:LangChain Chat Model 使用方法和流程\n", + "\n", + "在最终调用 `Chat Model` 时,一定是直接传入`LangChain Schema Messages(消息记录)` \n", + "\n", + "```python\n", + "from langchain.chat_models import ChatOpenAI\n", + "chat_model = ChatOpenAI(model_name=\"gpt-3.5-turbo\")\n", + "\n", + "from langchain.schema import (\n", + " AIMessage,\n", + " HumanMessage,\n", + " SystemMessage\n", + ")\n", + "\n", + "messages = [SystemMessage(content=\"You are a helpful assistant.\"),\n", + " HumanMessage(content=\"Who won the world series in 2020?\"),\n", + " AIMessage(content=\"The Los Angeles Dodgers won the World Series in 2020.\"), \n", + " HumanMessage(content=\"Where was it played?\")]\n", + "\n", + "print(messages)\n", + "\n", + "chat_model(messages)\n", + "```\n", + "\n", + "打印 messages 输出结果:\n", + "```\n", + "[\n", + " SystemMessage(content=\"You are a helpful assistant.\", additional_kwargs={}),\n", + " HumanMessage(\n", + " content=\"Who won the world series in 2020?\", additional_kwargs={}, example=False\n", + " ),\n", + " AIMessage(\n", + " content=\"The Los Angeles Dodgers won the World Series in 2020.\",\n", + " additional_kwargs={},\n", + " example=False,\n", + " ),\n", + " HumanMessage(content=\"Where was it played?\", additional_kwargs={}, example=False),\n", + "]\n", + "```\n", + "\n", + "调用 chat_model(messages) 返回结果:\n", + "\n", + "```\n", + "AIMessage(\n", + " content=\"The 2020 World Series was played at Globe Life Field in Arlington, Texas.\",\n", + " additional_kwargs={},\n", + " example=False,\n", + ")\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "44337cac-457c-4971-b382-498dec3e3c24", + "metadata": {}, + "source": [ + "## 使用 System 和 Human 角色的提示模板构造 ChatPromptTemplate\n", + "\n", + "使用 `ChatPromptTemplate.from_messages` 方法,类似使用和维护`messages`的方式,构造 `chat_prompt_template`" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "0e4d76a2-1ccb-421a-9be2-1a4a6faa3b8b", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.schema import AIMessage, HumanMessage, SystemMessage\n", + "# 导入 Chat Model 即将使用的 Prompt Templates\n", + "from langchain.prompts.chat import (\n", + " ChatPromptTemplate,\n", + " SystemMessagePromptTemplate,\n", + " AIMessagePromptTemplate,\n", + " HumanMessagePromptTemplate,\n", + ")\n", + "\n", + "# 翻译任务指令始终由 System 角色承担\n", + "template = (\n", + " \"\"\"You are a translation expert, proficient in various languages. \\n\n", + " Translates English to Chinese.\"\"\"\n", + ")\n", + "system_message_prompt = SystemMessagePromptTemplate.from_template(template)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "596d74c4-0fa5-4308-9920-c8002a3ee0c7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='You are a translation expert, proficient in various languages. \\n\\n Translates English to Chinese.', template_format='f-string', validate_template=True) additional_kwargs={}\n" + ] + } + ], + "source": [ + "print(system_message_prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d2a8fddd-31d7-493f-94a1-4c1ec4608de6", + "metadata": {}, + "outputs": [], + "source": [ + "# 待翻译文本由 Human 角色输入\n", + "human_template = \"{text}\"\n", + "human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "8922d062-00a9-48dd-b1e2-ed2d2f61a9ad", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "prompt=PromptTemplate(input_variables=['text'], output_parser=None, partial_variables={}, template='{text}', template_format='f-string', validate_template=True) additional_kwargs={}\n" + ] + } + ], + "source": [ + "print(human_message_prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "125cef74-88a0-4775-a513-06692592d801", + "metadata": {}, + "outputs": [], + "source": [ + "# 使用 System 和 Human 角色的提示模板构造 ChatPromptTemplate\n", + "chat_prompt_template = ChatPromptTemplate.from_messages(\n", + " [system_message_prompt, human_message_prompt]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "d16f01ad-b7c9-4a0a-af8f-698f55d4762e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "input_variables=['text'] output_parser=None partial_variables={} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='You are a translation expert, proficient in various languages. \\n\\n Translates English to Chinese.', template_format='f-string', validate_template=True), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['text'], output_parser=None, partial_variables={}, template='{text}', template_format='f-string', validate_template=True), additional_kwargs={})]\n" + ] + } + ], + "source": [ + "print(chat_prompt_template)" + ] + }, + { + "cell_type": "markdown", + "id": "406a2804-96f5-4470-9464-a96572e31149", + "metadata": {}, + "source": [ + "### 规范化 Python 复杂对象\n", + "\n", + "- 使用在线工具 [Python Formatter](https://codebeautify.org/python-formatter-beautifier) \n", + "- 规范化 `chat_prompt_template`后再查看\n", + "- 注意:不要同事输入多个复杂对象\n", + "\n", + "```json\n", + "messages = [\n", + " SystemMessagePromptTemplate(\n", + " prompt=PromptTemplate(\n", + " input_variables=[],\n", + " output_parser=None,\n", + " partial_variables={},\n", + " template=\"You are a translation expert, proficient in various languages. \\n\\n Translates English to Chinese.\",\n", + " template_format=\"f-string\",\n", + " validate_template=True,\n", + " ),\n", + " additional_kwargs={},\n", + " ),\n", + " HumanMessagePromptTemplate(\n", + " prompt=PromptTemplate(\n", + " input_variables=[\"text\"],\n", + " output_parser=None,\n", + " partial_variables={},\n", + " template=\"{text}\",\n", + " template_format=\"f-string\",\n", + " validate_template=True,\n", + " ),\n", + " additional_kwargs={},\n", + " ),\n", + "]\n", + "\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "b5cf9238-9497-46c3-961f-860e797fcece", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ChatPromptValue(messages=[SystemMessage(content='You are a translation expert, proficient in various languages. \\n\\n Translates English to Chinese.', additional_kwargs={}), HumanMessage(content='I love programming.', additional_kwargs={}, example=False)])" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# 生成用于翻译的 Chat Prompt\n", + "chat_prompt_template.format_prompt(text=\"I love programming.\")" + ] + }, + { + "cell_type": "markdown", + "id": "d3a39ff5-2720-40db-80f4-913c8c622bfa", + "metadata": {}, + "source": [ + "## 使用 chat_prompt_template.to_messages 方法生成 Messages" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "a21508c5-4b13-4d0e-aa67-c56d554d2626", + "metadata": {}, + "outputs": [], + "source": [ + "# 生成聊天模型真正可用的消息记录 Messages\n", + "chat_prompt = chat_prompt_template.format_prompt(text=\"I love programming.\").to_messages()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "a158a289-e738-45e8-b440-3f857a708e57", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[SystemMessage(content='You are a translation expert, proficient in various languages. \\n\\n Translates English to Chinese.', additional_kwargs={}),\n", + " HumanMessage(content='I love programming.', additional_kwargs={}, example=False)]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chat_prompt" + ] + }, + { + "cell_type": "markdown", + "id": "56e2fa43-ba3c-401f-bba0-6bb27050a363", + "metadata": {}, + "source": [ + "## 使用 Chat Model(GPT-3.5-turbo)实际执行翻译任务\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "5d4e3bd9-fac3-4b67-9285-74e1793f9e74", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.chat_models import ChatOpenAI\n", + "\n", + "# 为了翻译结果的稳定性,将 temperature 设置为 0\n", + "translation_model = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "c46d710c-9892-4ce2-8dcf-c3450dae8d83", + "metadata": {}, + "outputs": [], + "source": [ + "translation_result = translation_model(chat_prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "63d7cadf-515a-41cd-82d7-1498a3d6a35f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AIMessage(content='我喜欢编程。', additional_kwargs={}, example=False)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "translation_result" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "d56c7a65-bc6c-4e19-8176-ddc457696d0b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "我喜欢编程。\n" + ] + } + ], + "source": [ + "# 查看翻译结果\n", + "print(translation_result.content)" + ] + }, + { + "cell_type": "markdown", + "id": "75f3e72a-90e0-40dc-afa8-0ff96c2b7d72", + "metadata": {}, + "source": [ + "## 使用 LLMChain 简化重复构造 ChatPrompt" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "4a57ee8c-dad1-429c-a791-2797a7b0c4ef", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.chains import LLMChain\n", + "\n", + "# 无需再每次都使用 to_messages 方法构造 Chat Prompt\n", + "translation_chain = LLMChain(llm=translation_model, prompt=chat_prompt_template)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "723cebfa-a3bd-434f-9992-9ec25833acdf", + "metadata": {}, + "outputs": [], + "source": [ + "# 等价于 translation_result.content (字符串类型)\n", + "chain_result = translation_chain.run({'text': \"I love programming.\"})" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "21e1d186-88de-4ca8-ad32-c233d99145c4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "我喜欢编程。\n" + ] + } + ], + "source": [ + "print(chain_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "faddd232-0308-45ce-b2a6-899f261abe57", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'我喜欢人工智能和大型语言模型。'" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "translation_chain.run({'text': \"I love AI and Large Language Model.\"})" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "4f3dcb51-7e3f-4787-84aa-60f45def661e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'[水果, 颜色, 价格(美元)] [苹果, 红色, 1.20] [香蕉, 黄色, 0.50] [橙子, 橙色, 0.80] [草莓, 红色, 2.50] [蓝莓, 蓝色, 3.00] [猕猴桃, 绿色, 1.00] [芒果, 橙色, 1.50] [葡萄, 紫色, 2.00]'" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "translation_chain.run({'text': \"[Fruit, Color, Price (USD)] [Apple, Red, 1.20] [Banana, Yellow, 0.50] [Orange, Orange, 0.80] [Strawberry, Red, 2.50] [Blueberry, Blue, 3.00] [Kiwi, Green, 1.00] [Mango, Orange, 1.50] [Grape, Purple, 2.00]\"})\n" + ] + }, + { + "cell_type": "markdown", + "id": "a47a1bfd-2beb-41eb-a1c1-c6443e4ec129", + "metadata": {}, + "source": [ + "## 扩展:支持多语言对翻译" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "bb9b37f6-4bcb-450c-b3c6-e476a8f15bc7", + "metadata": {}, + "outputs": [], + "source": [ + "# System 增加 source_language 和 target_language\n", + "template = (\n", + " \"\"\"You are a translation expert, proficient in various languages. \\n\n", + " Translates {source_language} to {target_language}.\"\"\"\n", + ")\n", + "system_message_prompt = SystemMessagePromptTemplate.from_template(template)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "2fccc32f-2b3f-4666-82be-607c452893b7", + "metadata": {}, + "outputs": [], + "source": [ + "# 待翻译文本由 Human 角色输入\n", + "human_template = \"{text}\"\n", + "human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "fd937e81-c54f-49d9-9422-c43670d8e5db", + "metadata": {}, + "outputs": [], + "source": [ + "# 使用 System 和 Human 角色的提示模板构造 ChatPromptTemplate\n", + "m_chat_prompt_template = ChatPromptTemplate.from_messages(\n", + " [system_message_prompt, human_message_prompt]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "ab237375-59ab-40ac-986b-7254989b3418", + "metadata": {}, + "outputs": [], + "source": [ + "m_translation_chain = LLMChain(llm=translation_model, prompt=m_chat_prompt_template) " + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "e728621c-008e-4c2b-bb57-483f1ee4c1e8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'I enjoy studying large language models, as it is easy, simple, and enjoyable.'" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "m_translation_chain.run({\n", + " \"source_language\": \"Chinese\",\n", + " \"target_language\": \"English\",\n", + " \"text\": \"我喜欢学习大语言模型,轻松简单又愉快\",\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "183455b3-9cf1-42b4-be91-9c699248266a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'私は大規模言語モデルの学習が好きです。それは簡単で楽しいです。'" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "m_translation_chain.run({\n", + " \"source_language\": \"Chinese\",\n", + " \"target_language\": \"Japanese\",\n", + " \"text\": \"我喜欢学习大语言模型,轻松简单又愉快\",\n", + "})" + ] + }, + { + "cell_type": "markdown", + "id": "d42bf232-d2ef-47d8-9029-97940d7f784a", + "metadata": {}, + "source": [ + "## Homework\n", + "- 尝试不同的 System Prompt 和 Chat Model,对比翻译效果。\n", + "- 根据翻译任务的使用场景,是否可以在初次传入 source_language 和 target_language 后不再更新?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38ee618a-bbaa-476f-80e6-8d22a81fddf6", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/langchain/jupyter/homework/panxizhi/openai-translator/requirements.txt b/langchain/jupyter/homework/panxizhi/openai-translator/requirements.txt new file mode 100644 index 00000000..a331ebea --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/openai-translator/requirements.txt @@ -0,0 +1,12 @@ +pdfplumber +simplejson +requests +PyYAML +pillow +reportlab +pandas +loguru +openai +langchain +gradio +flask \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/sales_chatbot/create_sales_vector_dbs.py b/langchain/jupyter/homework/panxizhi/sales_chatbot/create_sales_vector_dbs.py new file mode 100644 index 00000000..f3d38423 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/sales_chatbot/create_sales_vector_dbs.py @@ -0,0 +1,26 @@ +from langchain.embeddings.openai import OpenAIEmbeddings +from langchain.text_splitter import CharacterTextSplitter +from langchain.vectorstores import FAISS + +from langchain.text_splitter import CharacterTextSplitter + +def load_data_to_vectordb(file, db_name): + with open(file, encoding='utf8') as f: + text = f.read() + + text_splitter = CharacterTextSplitter( + separator = r'$\d+\.\n', + chunk_size = 0, + chunk_overlap = 0, + length_function = len, + is_separator_regex = True, + ) + docs = text_splitter.create_documents([text]) + + db = FAISS.from_documents(docs, OpenAIEmbeddings()) + + db.save_local(db_name) + +if __name__ == "__main__": + load_data_to_vectordb("data/sales_eletronic_stuffs.txt", "sales_eletronic_stuffs") + load_data_to_vectordb("data/real_estate_sales_data.txt", "real_estate_sales_data") \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/sales_chatbot/data/real_estate_sales_data.txt b/langchain/jupyter/homework/panxizhi/sales_chatbot/data/real_estate_sales_data.txt new file mode 100644 index 00000000..453fb6a8 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/sales_chatbot/data/real_estate_sales_data.txt @@ -0,0 +1,279 @@ +1. +[客户问题] 这个小区交通便利吗? +[销售回答] 当然了,这个小区距离地铁站只有几分钟的步行距离,而且附近有多条公交线路,非常方便。 + +2. +[客户问题] 我担心楼下太吵。 +[销售回答] 这个小区特别注重居住体验,我们有良好的隔音设计,并且小区内部规划了绿化区域,可以有效降低噪音。 + +3. +[客户问题] 我看房价还在涨,这个投资回报怎么样? +[销售回答] 这个区域未来有大量的商业和基础设施建设,所以从长期来看,投资回报非常有保证。 + +4. +[客户问题] 有没有学校? +[销售回答] 附近有多所优质的学校,非常适合有孩子的家庭。 + +5. +[客户问题] 物业管理怎么样? +[销售回答] 我们的物业管理得到了业主一致好评,服务非常到位。 + +6. +[客户问题] 我想要南向的房子。 +[销售回答] 很好,我们确实有一些朝南的单位,它们的采光特别好。 + +7. +[客户问题] 这个小区安全吗? +[销售回答] 当然,我们24小时安保巡逻,还有先进的监控系统。 + +8. +[客户问题] 预计什么时候交房? +[销售回答] 根据目前的进度,我们预计将在明年底交房。 + +9. +[客户问题] 我不想要一楼的房子。 +[销售回答] 我理解您的顾虑,我们还有多个楼层的房源可以选择。 + +10. +[客户问题] 有优惠吗? +[销售回答] 当然,如果您现在下订,我们可以给您一些优惠。 + +11. +[客户问题] 你们是否提供按揭服务? +[销售回答] 是的,我们与多家银行合作,可以帮助您快速办理按揭。 + +12. +[客户问题] 税费怎么算? +[销售回答] 我们可以提供详细的税费咨询服务,确保您清楚所有费用。 + +13. +[客户问题] 附近有医院吗? +[销售回答] 是的,附近有多家大型医院,医疗资源非常丰富。 + +14. +[客户问题] 我担心小区会很拥挤。 +[销售回答] 这个小区总体规划非常合理,保证了每个单元之间有足够的空间。 + +15. +[客户问题] 这个小区有游泳池和健身房吗? +[销售回答] 当然,我们提供全方位的生活设施,包括游泳池和健身房。 + +16. +[客户问题] 我需要两个停车位,怎么办? +[销售回答] 我们的小区有充足的停车位,并且可以额外购买,确保您的停车需求得到满足。 + +17. +[客户问题] 小区里有公园吗? +[销售回答] 是的,我们有专门设计的绿化区和儿童游乐园,确保您和家人能享受到高质量的生活。 + +18. +[客户问题] 我看到附近有个建筑工地。 +[销售回答] 那个工地预计很快就会完工,而且我们的窗户有很好的隔音效果。 + +19. +[客户问题] 房子是精装还是毛坯? +[销售回答] 我们提供多种选择,包括精装和毛坯,以满足不同客户的需求。 + +20. +[客户问题] 这里有商业区吗? +[销售回答] 是的,小区附近有大型购物中心和各种餐厅,非常方便。 + +21. +[客户问题] 我需要大阳台。 +[销售回答] 我们有一些户型设计带有大阳台,您可以在那里种花、晒衣或者放一个小茶几,非常宜人。 + +22. +[客户问题] 我担心电梯不够用。 +[销售回答] 每栋楼都配有多部高速电梯,基本上不用等待。 + +23. +[客户问题] 垃圾处理怎么样? +[销售回答] 我们有专门的垃圾分类和回收系统,非常环保。 + +24. +[客户问题] 我担心交房延期。 +[销售回答] 我们是知名的房地产开发商,一直严格按照计划执行,您完全可以放心。 + +25. +[客户问题] 我看这里有很多空置房。 +[销售回答] 这些房源都是新推出的,我们预计在近期内会有更多的家庭搬入。 + +26. +[客户问题] 楼下有便利店吗? +[销售回答] 当然,小区内就有便利店,满足您日常生活的基本需求。 + +27. +[客户问题] 我需要分期付款。 +[销售回答] 我们有多种付款方案,包括分期付款,以减轻您的经济压力。 + +28. +[客户问题] 我家有宠物,可以养吗? +[销售回答] 当然,我们是一个宠物友好的社区,但需要遵守一些基本规则。 + +29. +[客户问题] 我想看下水电气等设施。 +[销售回答] 没问题,我们使用的都是品牌供应商,质量非常可靠。 + +30. +[客户问题] 你们会提供家具吗? +[销售回答] 我们的精装房会提供基础家具和家电,让您拎包入住。 + +31. +[客户问题] 都有哪些户型? +[销售回答] 我们有从一室到四室不等的多种户型,定能满足您不同的居住需求。 + +32. +[客户问题] 我听说这个区域交通常常堵塞。 +[销售回答] 我们小区的位置实际上是绕开了主要交通瓶颈区,所以相对来说会流畅很多。 + +33. +[客户问题] 我很关注环境保护,你们有做到吗? +[销售回答] 非常好的问题,我们的建筑采用环保材料,并且在小区规划中也有绿化区域和垃圾分类系统。 + +34. +[客户问题] 楼下会有商铺吗? +[销售回答] 是的,楼下将设有商铺,但都是服务性质的,不会影响您的居住体验。 + +35. +[客户问题] 社区活动丰富吗? +[销售回答] 我们有专门的社区活动中心,常常组织各种活动,非常丰富。 + +36. +[客户问题] 你们有提供家政服务吗? +[销售回答] 我们与多家家政服务公司有合作,可以方便您的日常生活。 + +37. +[客户问题] 我担心物业费会很高。 +[销售回答] 我们的物业费是根据市场和服务水平来设定的,绝对物有所值。 + +38. +[客户问题] 附近有地铁站吗? +[销售回答] 附近就有地铁站,而且有多条公交线路经过,出行非常方便。 + +39. +[客户问题] 这里网络信号怎么样? +[销售回答] 我们有完善的网络基础设施,信号覆盖非常好。 + +40. +[客户问题] 我想先租赁,观察一下再买。 +[销售回答] 这也是一个不错的选择,我们也提供租赁服务,您可以先体验一下。 + +41. +[客户问题] 有没有燃气管道? +[销售回答] 当然有,我们的所有单位都连接了燃气管道,非常安全和便利。 + +42. +[客户问题] 我是投资购房,租金回报如何? +[销售回答] 这个区域的租赁市场非常火爆,您投资购房后,租金回报率相当不错。 + +43. +[客户问题] 我喜欢安静,这里噪音大吗? +[销售回答] 我们特意进行了隔音设计,并且小区内部也有绿化带,整体非常安静。 + +44. +[客户问题] 我看这个小区还在建设中。 +[销售回答] 是的,目前正在建设中,但已经接近尾声,很快就会完工。 + +45. +[客户问题] 有哪些休闲设施? +[销售回答] 我们有游泳池、健身房、图书馆以及儿童游乐场等多种休闲设施。 + +46. +[客户问题] 我担心暖气不足。 +[销售回答] 我们使用的是中央供暖系统,温度可以保证。 + +47. +[客户问题] 需要交多少定金? +[销售回答] 定金金额取决于您选择的房型和付款方式,我们可以详细为您解释。 + +48. +[客户问题] 哪些银行可以贷款? +[销售回答] 我们与多家知名银行合作,贷款流程非常方便。 + +49. +[客户问题] 我可以改变房间的布局吗? +[销售回答] 对于毛坯房,您有更多的自由度来改变房间布局。 + +50. +[客户问题] 你们怎么解决停车问题? +[销售回答] 我们有地下车库和多个停车位,确保每个家庭都有地方停车。 + +51. +[客户问题] 这里安全吗? +[销售回答] 非常安全。我们有24小时的安保人员和先进的监控系统。 + +52. +[客户问题] 我需要比较大的储藏空间。 +[销售回答] 我们的设计包括了充足的储藏空间,以满足您的需求。 + +53. +[客户问题] 这里的电梯需要等很久吗? +[销售回答] 不会的,我们的电梯系统高效可靠,响应速度很快。 + +54. +[客户问题] 你们接受定制吗? +[销售回答] 当然,对于特定的户型,我们提供一定程度的定制服务。 + +55. +[客户问题] 我看周围没有学校。 +[销售回答] 其实附近就有几所知名的学校,并且我们也在考虑未来在社区内建立教育设施。 + +56. +[客户问题] 我有老人需要照顾,有便利吗? +[销售回答] 我们小区有无障碍通道和专门的老年活动中心,非常适合老年人。 + +57. +[客户问题] 这里会不会很吵? +[销售回答] 我们有良好的隔音设计和规划,内部环境非常宁静。 + +58. +[客户问题] 这个价位对我来说有点高。 +[销售回答] 我们有不同户型和付款方案,一定有适合您预算的。 + +59. +[客户问题] 物业管理怎么样? +[销售回答] 我们的物业管理得到业主一致好评,服务周到且高效。 + +60. +[客户问题] 如果我想转卖这个房子呢? +[销售回答] 我们的房子位于黄金地段,升值潜力很大,转卖起来不会有问题。 + +61. +[客户问题] 这个小区规模太大了。 +[销售回答] 实际上,大规模的社区能提供更多的设施和更完善的服务。 + +62. +[客户问题] 交通如何? +[销售回答] 交通非常便利,不仅靠近地铁站,还有多条公交线路。 + +63. +[客户问题] 附近有医院吗? +[销售回答] 有的,距离我们小区不远就有几家大型综合医院。 + +64. +[客户问题] 我担心窗户太小,采光不好。 +[销售回答] 我们所有的户型都经过精心设计,以确保最佳的采光和通风。 + +65. +[客户问题] 这个房子看起来太小了。 +[销售回答] 虽然面积可能不大,但我们的设计使得每一寸空间都得到了充分利用。 + +66. +[客户问题] 我看附近都是老房子。 +[销售回答] 这个区域正在快速发展,未来的升值潜力非常大。 + +67. +[客户问题] 有没有健身房? +[销售回答] 当然,我们的小区内有设备齐全的健身房。 + +68. +[客户问题] 我需要考虑一下。 +[销售回答] 当然,这是一个重要的决定。但我建议您不要等太久,毕竟好房不等人。 + +69. +[客户问题] 我担心楼下的商业活动会很吵。 +[销售回答] 我们在规划时就已经考虑到这一点,商业区和居住区有一定的距离和隔音设计。 + +70. +[客户问题] 我担心新房会有甲醛。 +[销售回答] 我们使用环保材料建造,并且所有的新房在交付前都会进行甲醛检测。 \ No newline at end of file diff --git a/langchain/jupyter/homework/panxizhi/sales_chatbot/data/sales_eletronic_stuffs.txt b/langchain/jupyter/homework/panxizhi/sales_chatbot/data/sales_eletronic_stuffs.txt new file mode 100644 index 00000000..3bdf8190 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/sales_chatbot/data/sales_eletronic_stuffs.txt @@ -0,0 +1,185 @@ + +1. +顾客:你好,我想购买一台新的冰箱。你能推荐一款吗? +销售员:当然可以!您需要多大的容量?还有其他特殊需求吗? +顾客:我需要一个容量适中的冰箱,最好带有冷冻功能。 +销售员:我推荐我们新款的ABC型冰箱,容量适中,配备了先进的冷冻技术,可以满足您的需求。 + +2. +顾客:我需要购买一台新的洗衣机,但我对不同型号和功能不太了解。 +销售员:没问题,我可以帮您解决疑惑。您对洗衣机的容量有任何要求吗? +顾客:我需要一台大容量的洗衣机,最好具备快速洗涤和节能功能。 +销售员:我们的XYZ型洗衣机是一款高容量的洗衣机,具备快速洗涤选项,并且符合节能标准。 + +3. +顾客:我想买一台新的电视,但我对最新的技术不太熟悉。有什么推荐吗? +销售员:当然!您对屏幕尺寸有任何要求吗?还有您对高清图像或智能功能有兴趣吗? +顾客:我需要一台大屏幕的电视,最好支持高清图像和智能功能。 +销售员:我们的EFG型电视是一款大屏幕电视,具备高清图像和智能功能,您会对其表现非常满意。 + +4. +顾客:我正在寻找一台新的空调,怎样才能选择合适的型号? +销售员:选择空调时,您需要考虑房间的大小和您对温度控制的需求。您对能效有任何要求吗? +顾客:我需要一台适合中等大小房间的空调,最好具备节能功能。 +销售员:我们的HIJ型空调是一款适合中等大小房间的空调,具备优异的节能性能,可以帮助您降低能源消耗。 + +5. +顾客:我想购买一台新的咖啡机,但我对不同类型和品牌不太了解。 +销售员:没问题,咖啡机有很多选择,我可以帮您找到适合您的。您更喜欢滴漏咖啡机还是胶囊咖啡机? +顾客:我更喜欢胶囊咖啡机,方便简单。 +销售员:我们的KLM型胶囊咖啡机是一款简便易用的咖啡机,配备了多种口味的咖啡胶囊,适合您的需求。 + +6. +顾客:我需要购买一个新的微波炉,我应该选择哪种类型? +销售员:微波炉有各种类型,您更注重烹饪功能还是外观设计? +顾客:我更关注烹饪功能,我希望能有多种预设菜单和解冻功能。 +销售员:我们的NOP型微波炉具备多种预设菜单和解冻功能,可以满足您的需求。 + +7. +顾客:我想购买一个新的音响系统,我应该选择有线还是无线? +销售员:选择音响系统时,您需要考虑您的音源设备和使用环境。您更喜欢方便携带的无线音响还是更稳定的有线连接? +顾客:我更喜欢无线音响,因为我可以在不同房间中移动。 +销售员:我们的QRS型无线音响系统具备稳定的无线连接和高质量音效,适合您的需求。 + +8. +顾客:我需要购买一台新的吸尘器,我应该选择立式还是手持式? +销售员:您打算用吸尘器清理哪些区域?如果需要清理大面积地板,立式吸尘器可能更方便。手持吸尘器适合清理狭小区域和车辆内部。 +顾客:我主要需要清理地板,我会选择立式吸尘器。 +销售员:我们的TUV型立式吸尘器具备强大的吸力和易于操作的设计,适合清理地板和地毯。 + +9. +顾客:我想购买一个新的电饭煲,有什么推荐吗? +销售员:电饭煲有各种型号和功能,您对容量有任何要求吗?还有您是否需要多种烹饪模式? +顾客:我需要一个中等容量的电饭煲,并且希望具备煮粥和蒸煮功能。 +销售员:我们的WXY型电饭煲容量适中,具备多种烹饪模式,包括煮粥和蒸煮功能,符合您的需求。 + +10. +顾客:我正在寻找一台新的笔记本电脑,有什么建议吗? +销售员:笔记本电脑有不同的品牌和配置,您对使用场景和性能有任何要求吗?例如,处理器速度、内存容量和存储空间等。 +顾客:我需要一台适合办公和娱乐的笔记本电脑,性能稳定且存储空间充足。 +销售员:我们的YZX型笔记本电脑是一款适合办公和娱乐的选择,具备稳定的性能和大容量存储空间,可以满足您的需求。 + + +11. +顾客:我需要购买一台新的烤箱,有什么特别的功能可以推荐吗? +销售员:烤箱有各种特色功能,您对烘烤、烤制还是烧烤有更多兴趣? +顾客:我对烘烤和烤制食物很感兴趣。 +销售员:我们的ABC型烤箱具备多种预设烘烤程序和温度控制,可以满足您的烘烤和烤制需求。 + +12. +顾客:我正在考虑购买一个新的电动牙刷,有什么品牌和型号可以推荐吗? +销售员:电动牙刷有多个品牌和型号,您对清洁效果、震动频率还是附加功能有任何要求? +顾客:我希望能有一个高效的清洁效果,还可以有按摩牙龈的功能。 +销售员:我们的DEF型电动牙刷具备高效的清洁效果和牙龈按摩功能,可以满足您的需求。 + +13. +顾客:我想购买一个新的电动剃须刀,什么型号适合我? +销售员:电动剃须刀有旋转式和网状式两种类型,您对剃须方式有任何偏好吗? +顾客:我更喜欢旋转式剃须刀。 +销售员:我们的GHI型旋转式电动剃须刀具备高速旋转刀头和舒适的剃须体验,适合您的需求。 + +14. +顾客:我需要购买一个新的搅拌机,有什么功能可以推荐吗? +销售员:搅拌机有多种功能,您对打碎冰块、制作奶昔还是搅拌食材有任何特殊需求? +顾客:我希望搅拌机可以打碎冰块和制作奶昔。 +销售员:我们的JKL型搅拌机具备强大的破冰功能和多速度调节,可以满足您的需求。 + +15. +顾客:我正在寻找一台新的投影仪,有什么特点可以推荐吗? +销售员:投影仪有不同的分辨率、亮度和连接选项,您对使用场景和画质有任何特殊要求吗? +顾客:我需要一个高分辨率和亮度适中的投影仪,可以用于家庭影院。 +销售员:我们的MNO型投影仪具备高清晰度和适中的亮度,适合家庭影院和日常使用。 + +16. +顾客:我想购买一台新的洗衣机,有什么类型可以推荐吗? +销售员:洗衣机有滚筒式和波轮式两种类型,您对洗涤方式和洗衣容量有任何特殊要求吗? +顾客:我需要一个大容量的洗衣机,并且希望能有多种洗涤程序。 +销售员:我们的PQR型滚筒式洗衣机具备大容量和多种洗涤程序,可以满足您的需求。 + +17. +顾客:我正在考虑购买一个新的冰箱,有什么品牌和型号可以推荐吗? +销售员:冰箱有不同的容量、冷藏方式和附加功能,您对冷藏空间和节能性能有任何特殊要求吗? +顾客:我需要一个中等容量的冰箱,并且希望具备节能功能。 +销售员:我们的STU型冰箱具备中等容量和节能性能,可以满足您的需求。 + +18. +顾客:我想购买一个新的空调,有什么特点可以推荐吗? +销售员:空调有不同的制冷能力、能效等级和附加功能,您对房间面积和能效要求有任何特殊要求吗? +顾客:我需要一个适合中等大小房间的空调,并且具备高能效。 +销售员:我们的VWX型空调具备适中的制冷能力和高能效等级,适合中等大小房间使用。 + +19. +顾客:我正在寻找一个新的电视,有什么特点可以推荐吗? +销售员:电视有不同的屏幕尺寸、分辨率和智能功能,您对画质和连接选项有任何特殊要求吗? +顾客:我需要一个大屏幕的电视,并且希望具备高清晰度和智能功能。 +销售员:我们的YZA型电视具备大屏幕、高分辨率和智能功能,可以满足您的需求。 + +20. +顾客:我想购买一个新的电动扫地机器人,有什么品牌和型号可以推荐吗? +销售员:电动扫地机器人有不同的品牌和功能,您对清扫效果和智能导航有任何特殊要求吗? +顾客:我希望电动扫地机器人能够高效清扫并具备智能导航功能。 +销售员:我们的BCD型电动扫地机器人具备高效清扫和智能导航功能,可以满足您的需求。 + + +21. +顾客:我需要购买一个新的咖啡机,有什么类型可以推荐吗? +销售员:咖啡机有滴滤式、胶囊式和半自动式等不同类型,您对咖啡制作的便捷性和口味要求有任何特殊要求吗? +顾客:我希望咖啡机操作简单,而且能制作出浓郁的咖啡。 +销售员:我们的EFG型咖啡机具备简单的操作界面和浓郁的咖啡口味,可以满足您的需求。 + +22. +顾客:我正在考虑购买一个新的微波炉,有什么品牌和型号可以推荐吗? +销售员:微波炉有不同的容量、加热方式和预设程序,您对加热功率和多功能设置有任何特殊要求吗? +顾客:我需要一个中等容量的微波炉,并且希望具备多种预设程序。 +销售员:我们的HIJ型微波炉具备中等容量和多种预设程序,可以满足您的需求。 + +23. +顾客:我想购买一个新的电饭煲,有什么特点可以推荐吗? +销售员:电饭煲有不同的容量、煮饭方式和保温功能,您对煮饭量和保温时间有任何特殊要求吗? +顾客:我需要一个大容量的电饭煲,并且希望能有长时间的保温功能。 +销售员:我们的KLM型电饭煲具备大容量和长时间保温功能,可以满足您的需求。 + +24. +顾客:我正在寻找一个新的吸尘器,有什么特点可以推荐吗? +销售员:吸尘器有直立式、手持式和无线式等不同类型,您对吸力强度和清洁功能有任何特殊要求吗? +顾客:我希望吸尘器具备强大的吸力和多功能清洁附件。 +销售员:我们的NOP型吸尘器具备强大的吸力和多种清洁附件,可以满足您的需求。 + +25. +顾客:我想购买一个新的电动搬运车,有什么品牌和型号可以推荐吗? +销售员:电动搬运车有不同的承载能力和行驶方式,您对搬运重量和操控方式有任何特殊要求吗? +顾客:我需要一个能够承载较重物品的电动搬运车,并且希望操控简便。 +销售员:我们的QRS型电动搬运车具备较高的承载能力和简便的操控方式,可以满足您的需求。 + + +26. +顾客:我想购买一个新的电动剃须刀,有什么品牌和型号可以推荐吗? +销售员:电动剃须刀有旋转式和往复式两种类型,您对刮胡方式和剃须舒适度有任何特殊要求吗? +顾客:我希望电动剃须刀能够提供干湿两种剃须方式,并且剃须时舒适不刺激皮肤。 +销售员:我们的TUV型电动剃须刀支持干湿两种剃须方式,并且具备舒适的剃须体验,可以满足您的需求。 + +27. +顾客:我正在考虑购买一个新的音响系统,有什么品牌和型号可以推荐吗? +销售员:音响系统有立体声、环绕声和多房间音响等不同类型,您对音质表现和连接选项有任何特殊要求吗? +顾客:我希望音响系统能够提供优质的音质表现,并且支持多种音频输入选项。 +销售员:我们的WXY型音响系统具备优质的音质表现和多种音频输入选项,可以满足您的需求。 + +28. +顾客:我想购买一个新的电动牙刷,有什么特点可以推荐吗? +销售员:电动牙刷有旋转式和声波式两种类型,您对刷牙效果和刷头种类有任何特殊要求吗? +顾客:我希望电动牙刷能够提供彻底的清洁效果,并且有多种刷头可以选择。 +销售员:我们的XYZ型电动牙刷具备彻底的清洁效果和多种刷头选项,可以满足您的需求。 + +29. +顾客:我正在寻找一个新的电动按摩椅,有什么特点可以推荐吗? +销售员:电动按摩椅有不同的按摩方式和强度调节,您对按摩部位和按摩程序有任何特殊要求吗? +顾客:我希望电动按摩椅能够提供全身的按摩功能,并且有多种按摩程序可选。 +销售员:我们的ABC型电动按摩椅具备全身按摩功能和多种按摩程序,可以满足您的需求。 + +30. +顾客:我想购买一个新的电动剪发器,有什么品牌和型号可以推荐吗? +销售员:电动剪发器有不同的剪发长度和附件选项,您对剪发效果和剪发方式有任何特殊要求吗? +顾客:我希望电动剪发器能够提供精准的剪发效果,并且具备多种剪发方式可选。 +销售员:我们的DEF型电动剪发器具备精准的剪发效果和多种剪发方式,可以满足您的需求。 + + diff --git a/langchain/jupyter/homework/panxizhi/sales_chatbot/sales.ipynb b/langchain/jupyter/homework/panxizhi/sales_chatbot/sales.ipynb new file mode 100644 index 00000000..c0da1e52 --- /dev/null +++ b/langchain/jupyter/homework/panxizhi/sales_chatbot/sales.ipynb @@ -0,0 +1,914 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "69032b4b-b031-4020-beb0-e8dfd22b660e", + "metadata": {}, + "source": [ + "# LangChain 实战:房产销售聊天机器人" + ] + }, + { + "cell_type": "markdown", + "id": "0f3e2d5a-c2ff-4faf-b153-d1751c25fae0", + "metadata": {}, + "source": [ + "## 使用 GPT-4 构造销售话术数据\n", + "\n", + "ChatGPT 分享链接:https://chat.openai.com/share/f3e4b9b0-95fb-4c6a-a3c7-f901dd194c91\n", + "\n", + "\n", + "使用 ChatGPT 构造销售数据的 Prompt 示例:\n", + "\n", + "```\n", + "你是中国顶级的房地产销售,现在培训职场新人,请给出100条实用的销售话术。\n", + "\n", + "每条销售话术以如下格式给出:\n", + "[客户问题]\n", + "[销售回答]\n", + "\n", + "```\n", + "\n", + "GPT-4 回复结果:\n", + "\n", + "```\n", + "在房地产销售行业中,精心准备的销售话术可以极大地提高成交率。以下是一些适用于中国房地产销售的实用话术:\n", + "\n", + "1. \n", + "[客户问题] 这个小区交通便利吗?\n", + "[销售回答] 当然了,这个小区距离地铁站只有几分钟的步行距离,而且附近有多条公交线路,非常方便。\n", + "\n", + "2. \n", + "[客户问题] 我担心楼下太吵。\n", + "[销售回答] 这个小区特别注重居住体验,我们有良好的隔音设计,并且小区内部规划了绿化区域,可以有效降低噪音。\n", + "\n", + "3. \n", + "[客户问题] 我看房价还在涨,这个投资回报怎么样?\n", + "[销售回答] 这个区域未来有大量的商业和基础设施建设,所以从长期来看,投资回报非常有保证。\n", + "\n", + "4. \n", + "[客户问题] 有没有学校?\n", + "[销售回答] 附近有多所优质的学校,非常适合有孩子的家庭。\n", + "\n", + "5. \n", + "[客户问题] 物业管理怎么样?\n", + "[销售回答] 我们的物业管理得到了业主一致好评,服务非常到位。\n", + "\n", + "6. \n", + "[客户问题] 我想要南向的房子。\n", + "[销售回答] 很好,我们确实有一些朝南的单位,它们的采光特别好。\n", + "\n", + "7. \n", + "[客户问题] 这个小区安全吗?\n", + "[销售回答] 当然,我们24小时安保巡逻,还有先进的监控系统。\n", + "\n", + "8. \n", + "[客户问题] 预计什么时候交房?\n", + "[销售回答] 根据目前的进度,我们预计将在明年底交房。\n", + "\n", + "9. \n", + "[客户问题] 我不想要一楼的房子。\n", + "[销售回答] 我理解您的顾虑,我们还有多个楼层的房源可以选择。\n", + "\n", + "10. \n", + "[客户问题] 有优惠吗?\n", + "[销售回答] 当然,如果您现在下订,我们可以给您一些优惠。\n", + "```\n", + "\n", + "\n", + "## 使用 Document Transformers 模块来处理原始数据\n", + "\n", + "\n", + "将 ChatGPT 生成的结果保存到 [real_estate_sales_data.txt](real_estate_sales_data.txt) 文件中" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "efcee928-b9f4-4e4a-b9fb-f6cbee21b642", + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"real_estate_sales_data.txt\") as f:\n", + " real_estate_sales = f.read()" + ] + }, + { + "cell_type": "markdown", + "id": "a9119556-d2e3-4edf-be16-15837da156f1", + "metadata": {}, + "source": [ + "### 使用 CharacterTextSplitter 来进行文本分割\n", + "\n", + "- 基于单字符来进行文本分割(separator)\n", + "- 基于字符数来决定文本块长度(chunk_size)\n", + "\n", + "参考示例:\n", + "\n", + "```python\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "text_splitter = CharacterTextSplitter( \n", + " separator = \"\\n\\n\",\n", + " chunk_size = 1000,\n", + " chunk_overlap = 200,\n", + " length_function = len,\n", + " is_separator_regex = False,\n", + ")\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "dbe535b8-2bd8-4c31-91d7-eba8aa61b3ef", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.text_splitter import CharacterTextSplitter" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "a4f30d9b-9fc6-4860-a497-0aa8220b9284", + "metadata": {}, + "outputs": [], + "source": [ + "text_splitter = CharacterTextSplitter( \n", + " separator = r'\\d+\\.',\n", + " chunk_size = 100,\n", + " chunk_overlap = 0,\n", + " length_function = len,\n", + " is_separator_regex = True,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "530f3172-17a5-4924-ad17-284fbca422ae", + "metadata": {}, + "outputs": [], + "source": [ + "docs = text_splitter.create_documents([real_estate_sales])" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "3956ec36-706e-4f43-88fc-d704f2984290", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(page_content='[客户问题] 这个小区交通便利吗?\\n[销售回答] 当然了,这个小区距离地铁站只有几分钟的步行距离,而且附近有多条公交线路,非常方便。', metadata={})" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "73b2d47a-27ae-4cd4-a72a-3809b4c22bff", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "70" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(docs)" + ] + }, + { + "cell_type": "markdown", + "id": "e4617858-ad9c-42eb-a4c3-6bfb7108cdd2", + "metadata": {}, + "source": [ + "### 使用 Faiss 作为向量数据库,持久化存储房产销售 问答对(QA-Pair)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "508c2ab2-e187-424f-a43c-ca3d3b1a0550", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.embeddings.openai import OpenAIEmbeddings\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain.vectorstores import FAISS\n", + "\n", + "db = FAISS.from_documents(docs, OpenAIEmbeddings())" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "250e0873-95e2-4372-98a5-68c0feb304bb", + "metadata": {}, + "outputs": [], + "source": [ + "query = \"小区吵不吵\"" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "0d18fcc6-c2be-434b-b8de-9648ac6f83f4", + "metadata": {}, + "outputs": [], + "source": [ + "answer_list = db.similarity_search(query)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e689667e-7a45-40d6-9eae-4e52cbc7daab", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[客户问题] 我担心楼下太吵。\n", + "[销售回答] 这个小区特别注重居住体验,我们有良好的隔音设计,并且小区内部规划了绿化区域,可以有效降低噪音。\n", + "\n", + "[客户问题] 我担心楼下的商业活动会很吵。\n", + "[销售回答] 我们在规划时就已经考虑到这一点,商业区和居住区有一定的距离和隔音设计。\n", + "\n", + "[客户问题] 我喜欢安静,这里噪音大吗?\n", + "[销售回答] 我们特意进行了隔音设计,并且小区内部也有绿化带,整体非常安静。\n", + "\n", + "[客户问题] 我担心小区会很拥挤。\n", + "[销售回答] 这个小区总体规划非常合理,保证了每个单元之间有足够的空间。\n", + "\n" + ] + } + ], + "source": [ + "for ans in answer_list:\n", + " print(ans.page_content + \"\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a6cc47bf-1566-472f-84d2-a46bd634907c", + "metadata": {}, + "outputs": [], + "source": [ + "db.save_local(\"real_estates_sale\")" + ] + }, + { + "cell_type": "markdown", + "id": "0d5323e5-03fa-4c1c-b896-3ccc28af1857", + "metadata": {}, + "source": [ + "### 使用 retriever 从向量数据库中获取结果\n", + "\n", + "#### 使用参数 `k` 指定返回结果数量\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "17f051c9-8676-48d1-a9a4-e69b52630faf", + "metadata": {}, + "outputs": [], + "source": [ + "topK_retriever = db.as_retriever(search_kwargs={\"k\": 3})" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "e33024db-ca7b-4d68-9b0e-c31c15ded57d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "VectorStoreRetriever(tags=['FAISS'], metadata=None, vectorstore=