-
Notifications
You must be signed in to change notification settings - Fork 34
Add doc-parsing connectors for PDF, DOCX, PPTX, HTML, and Markdown #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
268fca5
Added docs-parser
PredictiveManish 2ccc10e
Added docs-parser
PredictiveManish 0a38e90
Merge branch 'usemoss:main' into doc-parsing
PredictiveManish 7f97a4a
Removed unnecessary file
PredictiveManish effdb8e
Update packages/moss-doc-parser/src/moss_doc_parser/parsers/markdown.py
PredictiveManish 3d15e07
removed unnecessary imports
PredictiveManish 078128a
removed unnecessary imports
PredictiveManish c91e1c8
Added tests
PredictiveManish 38ceda6
added module initialized and proper instruction added
PredictiveManish File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| [build-system] | ||
| requires = ["setuptools>=61.0", "wheel"] | ||
| build-backend = "setuptools.build_meta" | ||
|
|
||
| [project] | ||
| name = "moss-doc-parser" | ||
| version = "0.1.0" | ||
| description = "Document parsing utilities for Moss semantic search" | ||
| readme = "README.md" | ||
| license-files = ["LICENSE"] | ||
| authors = [ | ||
| { name = "InferEdge Inc.", email = "[email protected]" } | ||
| ] | ||
| keywords = ["search", "semantic", "document", "parser", "moss"] | ||
| classifiers = [ | ||
| "Development Status :: 3 - Alpha", | ||
| "Intended Audience :: Developers", | ||
| "License :: OSI Approved :: BSD License", | ||
| "Programming Language :: Python :: 3", | ||
| "Programming Language :: Python :: 3.10", | ||
| "Programming Language :: Python :: 3.11", | ||
| "Programming Language :: Python :: 3.12", | ||
| "Programming Language :: Python :: 3.13", | ||
| "Topic :: Software Development :: Libraries :: Python Modules", | ||
| ] | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "pypdf>=3.0", | ||
| "python-docx>=1.0", | ||
| "python-pptx>=0.6", | ||
| "beautifulsoup4>=4.12", | ||
| "python-magic>=0.4", | ||
| "markdown>=3.0", | ||
| "typing-extensions>=4.0.0", | ||
| ] | ||
|
|
||
| [project.optional-dependencies] | ||
| dev = [ | ||
| "pytest>=8.0.0", | ||
| "black>=24.0.0", | ||
| "isort>=5.0.0", | ||
| "flake8>=7.0.0", | ||
| "mypy>=1.0.0", | ||
| "build>=1.0.0", | ||
| "twine>=5.0.0", | ||
| ] | ||
|
|
||
| [tool.setuptools.packages.find] | ||
| where = ["src"] | ||
|
|
||
| [tool.setuptools.package-dir] | ||
| "" = "src" | ||
|
|
||
| [tool.black] | ||
| line-length = 88 | ||
| target-version = ['py310'] | ||
|
|
||
| [tool.isort] | ||
| profile = "black" | ||
| line_length = 88 | ||
|
|
||
| [tool.mypy] | ||
| python_version = "3.10" | ||
| warn_return_any = false | ||
| warn_unused_configs = true | ||
| disallow_untyped_defs = true | ||
| ignore_missing_imports = true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| """Moss document parser package.""" | ||
|
|
||
| from .detector import FileTypeDetector | ||
| from .base import BaseParser | ||
| from .types import MossDocument, ParseResult | ||
|
|
||
| __all__ = [ | ||
| "FileTypeDetector", | ||
| "BaseParser", | ||
| "MossDocument", | ||
| "ParseResult", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| """Abstract base class for document parsers.""" | ||
|
|
||
| from abc import ABC, abstractmethod | ||
| from typing import List | ||
|
|
||
| from .types import ParseResult | ||
|
|
||
|
|
||
| class BaseParser(ABC): | ||
| """Abstract base class for all document parsers.""" | ||
|
|
||
| @abstractmethod | ||
| def parse(self, file_path: str) -> ParseResult: | ||
| """Parse a file and return a list of MossDocument objects. | ||
|
|
||
| Args: | ||
| file_path: Path to the file to parse. | ||
|
|
||
| Returns: | ||
| ParseResult containing the parsed documents and metadata. | ||
| """ | ||
| pass | ||
|
|
||
| @abstractmethod | ||
| def supported_extensions(self) -> List[str]: | ||
| """Return a list of file extensions this parser supports. | ||
|
|
||
| Returns: | ||
| List of file extensions (without the dot, e.g., ['pdf', 'docx']). | ||
| """ | ||
| pass |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| """File type detector for document parsers.""" | ||
|
|
||
| from typing import Dict, List, Type | ||
|
|
||
| from .base import BaseParser | ||
| from .parsers.html import HTMLParser | ||
| from .parsers.docx import DocxParser | ||
| from .parsers.markdown import MarkdownParser | ||
| from .parsers.pdf import PDFParser | ||
| from .parsers.pptx import PPTXParser | ||
|
PredictiveManish marked this conversation as resolved.
|
||
|
|
||
|
|
||
| class FileTypeDetector: | ||
| """Detects file type and returns appropriate parser.""" | ||
|
|
||
| def __init__(self): | ||
| self._parsers: Dict[str, Type[BaseParser]] = { | ||
| "pdf": PDFParser, | ||
| "docx": DocxParser, | ||
| "pptx": PPTXParser, | ||
| "html": HTMLParser, | ||
| "htm": HTMLParser, | ||
| "md": MarkdownParser, | ||
| "markdown": MarkdownParser, | ||
| } | ||
| # Try to initialize python-magic, but make it optional | ||
| self._magic_available = False | ||
| self._magic = None | ||
| try: | ||
| import magic | ||
|
|
||
| self._magic = magic.Magic(mime=True) | ||
| self._magic_available = True | ||
| except ImportError: | ||
| pass # magic not available, we'll rely on extension-based detection | ||
|
|
||
| def get_parser_for_file(self, file_path: str) -> BaseParser: | ||
| """Get the appropriate parser for a file based on its content type. | ||
|
|
||
| Args: | ||
| file_path: Path to the file to analyze. | ||
|
|
||
| Returns: | ||
| An instance of the appropriate parser class. | ||
|
|
||
| Raises: | ||
| ValueError: If no parser is available for the file type. | ||
| """ | ||
| # First try extension-based detection | ||
| extension = file_path.lower().split(".")[-1] if "." in file_path else "" | ||
| if extension in self._parsers: | ||
| return self._parsers[extension]() | ||
|
|
||
| # Fallback to magic byte detection if available | ||
| if self._magic_available: | ||
| try: | ||
| mime_type = self._magic.from_file(file_path) | ||
| mime_to_extension = { | ||
| "application/pdf": "pdf", | ||
| "application/vnd.openxmlformats-officedocument.wordprocessingml.document": "docx", | ||
| "application/vnd.openxmlformats-officedocument.presentationml.presentation": "pptx", | ||
| "text/html": "html", | ||
| "text/plain": "md", # Assume markdown for plain text | ||
|
PredictiveManish marked this conversation as resolved.
|
||
| # Note: This means plain text files without extensions will be processed as Markdown. | ||
| # For files with known extensions (like .txt, .csv), extension-based detection takes precedence. | ||
| # This is an acceptable trade-off as the markdown parser gracefully handles plain text. | ||
| } | ||
|
|
||
| extension = mime_to_extension.get(mime_type) | ||
| if extension and extension in self._parsers: | ||
| return self._parsers[extension]() | ||
| except Exception: | ||
| pass # Fall through to extension-based detection failure | ||
|
|
||
| raise ValueError(f"No parser available for file: {file_path}") | ||
|
|
||
| def get_supported_extensions(self) -> List[str]: | ||
| """Get list of all supported file extensions. | ||
|
|
||
| Returns: | ||
| List of supported file extensions (without the dot). | ||
| """ | ||
| return list(self._parsers.keys()) | ||
Empty file.
61 changes: 61 additions & 0 deletions
61
packages/moss-doc-parser/src/moss_doc_parser/parsers/docx.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| """DOCX document parser.""" | ||
|
|
||
| import time | ||
| from typing import Dict, List | ||
|
|
||
| from docx import Document as DocxDocument | ||
|
|
||
| from ..base import BaseParser | ||
| from ..types import MossDocument, ParseResult | ||
|
|
||
|
|
||
| class DocxParser(BaseParser): | ||
| """Parser for DOCX files.""" | ||
|
|
||
| def parse(self, file_path: str) -> ParseResult: | ||
| """Parse a DOCX file and extract text from paragraphs. | ||
|
|
||
| Args: | ||
| file_path: Path to the DOCX file. | ||
|
|
||
| Returns: | ||
| ParseResult containing one document per paragraph (or chunked if needed). | ||
| """ | ||
| start_time = time.time() | ||
|
|
||
| documents = [] | ||
| doc = DocxDocument(file_path) | ||
|
|
||
| for para_num, paragraph in enumerate(doc.paragraphs): | ||
| text = paragraph.text | ||
| if text.strip(): # Only add non-empty paragraphs | ||
| doc_id = f"{file_path}_para_{para_num}" | ||
| metadata = { | ||
| "source_file": file_path, | ||
| "paragraph_number": para_num + 1, # 1-indexed for humans | ||
| "total_paragraphs": len( | ||
| [p for p in doc.paragraphs if p.text.strip()] | ||
| ), | ||
| } | ||
| documents.append( | ||
| MossDocument( | ||
| id=doc_id, | ||
| text=text.strip(), | ||
| metadata=metadata, | ||
| ) | ||
| ) | ||
|
|
||
| parse_time_ms = (time.time() - start_time) * 1000 | ||
| return ParseResult( | ||
| documents=documents, | ||
| source_path=file_path, | ||
| parse_time_ms=parse_time_ms, | ||
| ) | ||
|
|
||
| def supported_extensions(self) -> List[str]: | ||
| """Return a list of file extensions this parser supports. | ||
|
|
||
| Returns: | ||
| List of file extensions (without the dot). | ||
| """ | ||
| return ["docx"] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.