-
Notifications
You must be signed in to change notification settings - Fork 6
git commit -m '<feat>: add single parse api' #544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
45fa4c3
feat: add html parse api
papayalove ea8133b
feat: add html parse api
papayalove 9717f0f
feat: add html parse api
papayalove 6458e2f
Merge branch 'ccprocessor:dev' into dev-feat-api
papayalove 0f90ca0
feat: api module
papayalove 3c0941f
Merge branch 'dev-feat-api' of https://github.com/papayalove/llm-webk…
papayalove f26d10f
feat: add html parse api with some model params changed for test
papayalove 7654eff
Merge branch 'ccprocessor:dev' into dev-feat-api
papayalove c6e78e3
<feat>: add single parse api
papayalove 7cd0e2f
Merge branch 'dev-feat-api' of https://github.com/papayalove/llm-webk…
papayalove 3081168
Merge branch 'ccprocessor:dev' into dev-feat-api
papayalove c7b0f5a
<fix>: fix match failure if there are too many same ids in one html, …
papayalove 4362931
feat: add html parse api
papayalove acb0bc9
feat: add html parse api
papayalove 4a97137
feat: add html parse api
papayalove d4382a4
feat: add html parse api
papayalove 1363708
Merge branch 'dev-feat-api' of https://github.com/papayalove/llm-webk…
papayalove 82886cc
feat: add html parse api
papayalove File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| # LLM Web Kit API | ||
|
|
||
| 基于 FastAPI 的 LLM Web Kit API 服务,提供 HTML 解析功能。 | ||
|
|
||
| ## 功能特性 | ||
|
|
||
| - 🚀 基于 FastAPI 的高性能 Web API | ||
| - 📄 HTML 内容解析与结构化输出 | ||
| - 🔗 支持 URL 和 HTML 字符串输入 | ||
| - 📁 支持 HTML 文件上传 | ||
| - 📚 自动生成的 API 文档 | ||
| - 🔧 可配置的解析选项 | ||
|
|
||
| ## 快速开始 | ||
|
|
||
| 配置环境变量 | ||
|
|
||
| ```bash | ||
| export MODEL_PATH="" | ||
| ``` | ||
|
|
||
| 或者配置文件.llm-web-kit.jsonc添加“model_path” | ||
|
|
||
| 安装依赖 | ||
|
|
||
| ```bash | ||
| pip install -r requirements.txt | ||
| python llm_web_kit/api/run_server.py | ||
| ``` | ||
|
|
||
| - Swagger UI: http://127.0.0.1:8000/docs | ||
| - ReDoc: http://127.0.0.1:8000/redoc | ||
|
|
||
| ## API 端点 | ||
|
|
||
| ### HTML 解析 | ||
|
|
||
| POST /api/v1/html/parse | ||
|
|
||
| 请求示例: | ||
|
|
||
| ```bash | ||
| curl -s -X POST "http://127.0.0.1:8000/api/v1/html/parse" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "html_content": "<html><body><h1>Hello World</h1></body></html>", | ||
| "url": "https://helloworld.com/hello", | ||
| "options": { | ||
| "clean_html": true | ||
| } | ||
| }' | ||
| ``` | ||
|
|
||
| 或直接发送以下 JSON 作为请求体: | ||
|
|
||
| ```json | ||
| { | ||
| "html_content": "<html><body><h1>Hello World</h1></body></html>", | ||
| "options": { | ||
| "clean_html": true | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### 文件上传解析 | ||
|
|
||
| POST /api/v1/html/upload | ||
|
|
||
| ```bash | ||
| curl -s -X POST "http://127.0.0.1:8000/api/v1/html/upload" \ | ||
| -F "file=@/path/to/file.html" | ||
| ``` | ||
|
|
||
| ### 服务状态 | ||
|
|
||
| GET /api/v1/html/status | ||
|
|
||
| ## 返回结构示例(/api/v1/html/parse 与 /api/v1/html/upload 成功返回) | ||
|
|
||
| 以下示例为 HTML 解析成功时的统一响应结构: | ||
|
|
||
| ```json | ||
| { | ||
| "success": true, | ||
| "message": "HTML 解析成功", | ||
| "timestamp": "2025-08-26T16:45:43.140638", | ||
| "data": { | ||
| "layout_file_list": [], | ||
| "typical_raw_html": "<html><body><h1>Hello World</h1></body></html>", | ||
| "typical_raw_tag_html": "<html><body><h1 _item_id=\"1\">Hello World</h1><h2 _item_id=\"2\">not main content</h2></body></html>\n", | ||
| "llm_response": { | ||
| "item_id 1": 0, | ||
| "item_id 2": 1 | ||
| }, | ||
| "typical_main_html": "<html><body><h1 _item_id=\"1\">Hello World</h1></body></html>", | ||
| "html_target_list": ["Hello World"] | ||
| }, | ||
| "metadata": null | ||
| } | ||
| ``` | ||
|
|
||
| ## 常见问题 | ||
|
|
||
| - 422 错误:确认请求头 `Content-Type: application/json`,并确保请求体 JSON 合法。 | ||
| - 依赖缺失:`pip install -r llm_web_kit/api/requirements.txt`。 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| """LLM Web Kit API 模块. | ||
|
|
||
| 提供基于 FastAPI 的 Web API 接口,用于处理 HTML 解析和内容提取功能。 | ||
| """ | ||
|
|
||
| __version__ = "1.0.0" | ||
| __author__ = "LLM Web Kit Team" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| """API 依赖项管理. | ||
|
|
||
| 包含 FastAPI 应用的依赖项、配置管理和共享服务。 | ||
| """ | ||
|
|
||
| import logging | ||
| from functools import lru_cache | ||
| from typing import Optional | ||
|
|
||
| from pydantic_settings import BaseSettings, SettingsConfigDict | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class Settings(BaseSettings): | ||
| """应用配置设置.""" | ||
|
|
||
| # API 配置 | ||
| api_title: str = "LLM Web Kit API" | ||
| api_version: str = "1.0.0" | ||
| api_description: str = "基于 LLM 的 Web 内容解析和提取 API 服务" | ||
|
|
||
| # 服务器配置 | ||
| host: str = "0.0.0.0" | ||
| port: int = 8000 | ||
| debug: bool = False | ||
|
|
||
| # 日志配置 | ||
| log_level: str = "INFO" | ||
|
|
||
| # 模型配置 | ||
| model_path: Optional[str] = None | ||
| max_content_length: int = 10 * 1024 * 1024 # 10MB | ||
|
|
||
| # 缓存配置 | ||
| cache_ttl: int = 3600 # 1小时 | ||
|
|
||
| # pydantic v2 配置写法 | ||
| model_config = SettingsConfigDict( | ||
| env_file=".env", | ||
| case_sensitive=False | ||
| ) | ||
|
|
||
|
|
||
| @lru_cache() | ||
| def get_settings() -> Settings: | ||
| """获取应用配置单例.""" | ||
| return Settings() | ||
|
|
||
|
|
||
| def get_logger(name: str = __name__) -> logging.Logger: | ||
| """获取配置好的日志记录器.""" | ||
| logger = logging.getLogger(name) | ||
| if not logger.handlers: | ||
| handler = logging.StreamHandler() | ||
| formatter = logging.Formatter( | ||
| '%(asctime)s - %(name)s - %(levelname)s - %(message)s' | ||
| ) | ||
| handler.setFormatter(formatter) | ||
| logger.addHandler(handler) | ||
| logger.setLevel(get_settings().log_level) | ||
| return logger | ||
|
|
||
|
|
||
| # 全局依赖项 | ||
| settings = get_settings() | ||
|
|
||
| # InferenceService 单例 | ||
| _inference_service_singleton = None | ||
|
|
||
|
|
||
| def get_inference_service(): | ||
| """获取 InferenceService 单例.""" | ||
| global _inference_service_singleton | ||
| if _inference_service_singleton is None: | ||
| from .services.inference_service import InferenceService | ||
| _inference_service_singleton = InferenceService() | ||
| return _inference_service_singleton |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| """FastAPI 应用主入口. | ||
|
|
||
| 提供 LLM Web Kit 的 Web API 服务,包括 HTML 解析、内容提取等功能。 | ||
| """ | ||
|
|
||
| import uvicorn | ||
| from fastapi import FastAPI | ||
| from fastapi.middleware.cors import CORSMiddleware | ||
| from fastapi.responses import JSONResponse | ||
|
|
||
| from .dependencies import get_inference_service, get_logger, get_settings | ||
| from .routers import htmls | ||
|
|
||
| settings = get_settings() | ||
| logger = get_logger(__name__) | ||
|
|
||
|
|
||
| # 创建 FastAPI 应用实例(元数据读取自 Settings) | ||
| app = FastAPI( | ||
| title=settings.api_title, | ||
| description=settings.api_description, | ||
| version=settings.api_version, | ||
| docs_url="/docs", | ||
| redoc_url="/redoc" | ||
| ) | ||
|
|
||
| # 添加 CORS 中间件 | ||
| app.add_middleware( | ||
| CORSMiddleware, | ||
| allow_origins=["*"], # 在生产环境中应该限制具体域名 | ||
| allow_credentials=True, | ||
| allow_methods=["*"], | ||
| allow_headers=["*"], | ||
| ) | ||
|
|
||
| # 注册路由 | ||
| app.include_router(htmls.router, prefix="/api/v1", tags=["HTML 处理"]) | ||
|
|
||
|
|
||
| @app.get("/") | ||
| async def root(): | ||
| """根路径,返回服务状态信息.""" | ||
| return { | ||
| "message": "LLM Web Kit API 服务运行中", | ||
| "version": settings.api_version, | ||
| "status": "healthy" | ||
| } | ||
|
|
||
|
|
||
| @app.get("/health") | ||
| async def health_check(): | ||
| """健康检查端点.""" | ||
| return {"status": "healthy", "service": "llm-web-kit-api"} | ||
|
|
||
|
|
||
| @app.on_event("startup") | ||
| async def app_startup(): | ||
| """应用启动时预热模型,避免首个请求冷启动延迟.""" | ||
| try: | ||
| service = get_inference_service() | ||
| await service.warmup() | ||
| logger.info("InferenceService 模型预热完成") | ||
| except Exception as e: | ||
| logger.warning(f"InferenceService 预热失败(服务仍可运行,将在首次请求时再初始化): {e}") | ||
|
|
||
|
|
||
| @app.exception_handler(Exception) | ||
| async def global_exception_handler(request, exc): | ||
| """全局异常处理器.""" | ||
| logger.error(f"未处理的异常: {exc}") | ||
| return JSONResponse( | ||
| status_code=500, | ||
| content={"detail": "服务器内部错误", "error": str(exc)} | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| # 开发环境运行 | ||
| uvicorn.run( | ||
| "llm_web_kit.api.main:app", | ||
| host=settings.host, | ||
| port=settings.port, | ||
| reload=True, | ||
| log_level=(settings.log_level or "INFO").lower() | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| """Pydantic 模型模块. | ||
|
|
||
| 包含所有 API 请求和响应的数据模型定义。 | ||
| """ | ||
|
|
||
| from .request import HTMLParseRequest | ||
| from .response import ErrorResponse, HTMLParseResponse | ||
|
|
||
| __all__ = [ | ||
| "HTMLParseRequest", | ||
| "HTMLParseResponse", | ||
| "ErrorResponse" | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| """请求数据模型. | ||
|
|
||
| 定义 API 请求的数据结构和验证规则。 | ||
| """ | ||
|
|
||
| from typing import Any, Dict, Optional | ||
|
|
||
| from pydantic import BaseModel, ConfigDict, Field | ||
|
|
||
|
|
||
| class HTMLParseRequest(BaseModel): | ||
| """HTML 解析请求模型.""" | ||
|
|
||
| html_content: Optional[str] = Field( | ||
| None, | ||
| description="HTML 内容字符串", | ||
| max_length=10485760 # 10MB | ||
| ) | ||
|
|
||
| url: Optional[str] = Field( | ||
| None, | ||
| description="url 地址", | ||
| max_length=10485760 # 10MB | ||
| ) | ||
|
|
||
| options: Optional[Dict[str, Any]] = Field( | ||
| default_factory=dict, | ||
| description="解析选项配置" | ||
| ) | ||
|
|
||
| model_config = ConfigDict( | ||
| json_schema_extra={ | ||
| "example": { | ||
| "html_content": "<html><body><h1>Hello World</h1></body></html>", | ||
| "url": "https://helloworld.com/hello", | ||
| "options": { | ||
| "clean_html": True | ||
| } | ||
| } | ||
| } | ||
| ) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.