This project provides an API endpoint for loading various document types (e.g., PDF, DOCX, HTML, TXT), splitting their textual content into manageable chunks, and returning these chunks along with metadata. This specific service focuses on the Load and Split stages. The output is designed to be suitable for preprocessing data for downstream tasks, particularly for Retrieval Augmented Generation (RAG) pipelines where these chunks would typically be fed into an embedding model.
The service is built as a Python FastAPI application. It leverages the Unstructured library for robust document parsing and content extraction, and Langchain for its text splitting capabilities (specifically RecursiveCharacterTextSplitter
).
The application is designed to be:
- 📦 Containerized: Using Docker for consistent environments and deployment.
- ☁️ Serverless-ready: Deployable as an AWS Lambda function, managed via the Serverless Framework.
The application uses several environment variables for configuration, managed through a .env
file and a config.py
file.
Variable | Purpose | Default (in config.py ) |
File(s) Used In |
---|---|---|---|
DELETE_TEMP_FILE |
If 1 , temporary files created during processing will be deleted. |
True |
config.py , split.py |
NLTK_DATA |
Path to NLTK data directory, needed for tokenizers used by unstructured . |
/tmp/nltk_data |
config.py , split.py |
MAX_FILE_SIZE_IN_MB |
Maximum allowed file size for uploads in Megabytes. | 10.0 |
config.py , split.py |
SUPPORTED_FILE_TYPES |
Comma-separated string of allowed MIME types for uploaded files. | See config.py for a comprehensive list |
config.py , split.py |
CHUNK_SIZE |
Target size for text chunks in characters. | 500 |
config.py , split.py |
CHUNK_OVERLAP |
Number of characters to overlap between consecutive chunks. | 20 |
config.py , split.py |
HOST |
Host address for the Uvicorn server in local development. | 0.0.0.0 |
config.py |
PORT |
Port for the Uvicorn server in local development. | 8000 |
config.py |
RUNTIME |
Used to indicate the running environment, e.g., "aws-lambda". | None |
config.py , split.py |
HF_HOME |
Path to HuggingFace cache directory. Relevant if unstructured uses models from HuggingFace Hub. |
/tmp/hf_home |
config.py |
- Python 3.11+
- Docker
- Node.js (for Serverless Framework)
-
Clone the repository.
-
Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
NLTK Data: The
unstructured
library requires NLTK data packages. The application is configured to look for them in the path specified by theNLTK_DATA
environment variable. -
Create a
.env
file: Copy the contents of the example below into a.env
file in the project root to configure the application for local development.HOST=0.0.0.0 PORT=8000 DELETE_TEMP_FILE=1 NLTK_DATA=/tmp/nltk_data MAX_FILE_SIZE_IN_MB=10 SUPPORTED_FILE_TYPES=text/plain,application/pdf,text/html,text/markdown,application/vnd.ms-powerpoint,application/vnd.openxmlformats-officedocument.presentationml.presentation,application/msword,application/vnd.openxmlformats-officedocument.wordprocessingml.document,application/epub+zip,message/rfc822,application/gzip CHUNK_SIZE=500 CHUNK_OVERLAP=20 HF_HOME=/tmp/hf_home
-
Running the application locally: Use the provided shell script to start the server with Uvicorn:
./start_server.sh
Alternatively, run the
split.py
script directly:python split.py
-
Running with Docker:
- Build the Docker image:
./docker-build.sh
- Run the Docker container:
./docker-run.sh
The
docker-compose.yaml
file is also available for running the service with Docker Compose. - Build the Docker image:
Uploads a document, splits its textual content, and returns the chunks.
- Request:
- Method:
POST
- Content-Type:
multipart/form-data
- Body: Must include a
file
field containing the document.
- Method:
- Query Parameters:
q_chunk_size
(integer, optional): Desired chunk size. Defaults toCHUNK_SIZE
.q_chunk_overlap
(integer, optional): Desired chunk overlap. Defaults toCHUNK_OVERLAP
.
- Response (200 OK):
A JSON object with the following structure:
{ "content": "string or null", "mime_type": "string", "items": [ { "content": "string", "metadata": { "source": "string", "id": "string", // ... other metadata } } ] }
curl
Example:curl -X POST -F "file=@/path/to/your/document.pdf" "http://localhost:8000/split?q_chunk_size=1000&q_chunk_overlap=100"
Returns the current operational configuration of the service.
- Response (200 OK):
A JSON object detailing the service's settings:
{ "delete_temp_file": true, "nltk_data": "/tmp/nltk_data", "max_file_size_in_mb": 10.0, "supported_file_types": [ "text/plain", "application/pdf", // ... ], "chunk_size": 500, "chunk_overlap": 50 }
curl
Example:curl http://localhost:8000/split/config
The service is designed for serverless deployment on AWS Lambda using the Serverless Framework. The serverless.yml
file configures the Lambda function, API Gateway trigger, and environment variables. The Dockerfile-AwsLambda
is used to build the container image for deployment.
The .github/workflows/dev.yml
file contains a GitHub Actions workflow for deploying to a development environment on AWS.
A GitHub Actions workflow is also provided for deploying the application to a Virtual Private Server (VPS) in .github/workflows/deploy-vps.yml
.
The project uses multiple requirements.txt
files for different environments:
requirements.txt
: For local development and testing.deploy-requirements.txt
: Production dependencies for the full-featured AWS Lambda deployment.requirements-text-only.txt
: A minimal set of dependencies for a text-only version of the service.
It is important to regularly review and update dependencies and use tools like GitHub Dependabot, Snyk, or Trivy for vulnerability scanning.
This project is licensed under the MIT License.