The project aims to develop an automated system capable of grading GitHub repositories and transforming various data types into actionable insights. The system leverages modern stream processing frameworks, microservices, and local large language models (LLMs) to ensure scalability, efficiency, and cost-effectiveness.
- Features
- Project Structure
- Prerequisites
- Installation
- Usage
- Configuration
- Running the Dataflows Manually
- Testing
- Fetch commits from GitHub repositories
- Process commit messages to generate summaries
- Store results in a vector database
- Kafka integration for message streaming
- Configurable to use different chat models and providers (OpenAI, Fake model, local model)
- Kafka KRaft instance for broker management
- Kafka UI for cluster management
- REST Proxy for interacting with Kafka topics via REST API
- Schema Registry for managing Kafka message schemas
- FastAPI application for user and job CRUD operations
- Debezium for change data capture from PostgreSQL to Kafka
- Streamlit frontend for job submission and document viewing
gutenberg/
├── config/
│ ├── __init__.py
│ └── config_setting.py
├── dataflow_connectors/
│ └── fastapi_connector.py
├── dataflows/
│ ├── __init__.py
│ ├── add_qdrant_service.py
│ ├── commit_summary_service.py
│ ├── gateway_service.py
│ ├── github_commit_processing.py
│ └── pdfProcessing.py
├── debezium-setup/
│ ├── connector-config.json
│ └── init-connector-config.sh
├── kui/
│ └── config.yml
├── logging_config/
│ └── __init__.py
├── models/
│ ├── __init__.py
│ ├── commit.py
│ ├── document.py
│ └── gateway.py
├── services/
│ ├── __init__.py
│ ├── github_service.py
│ ├── message_processing_service.py
│ ├── pdf_processing_service.py
│ └── vectordb_service.py
├── tests/
│ ├── __init__.py
│ ├── conftest.py
│ ├── intergration_tests/
│ │ ├── __init__.py
│ │ ├── test_debezium.py
│ │ ├── test_kafka_github.py
│ │ └── test_kafka_pdf.py
│ └── unit_tests/
│ ├── __init__.py
│ ├── test_fast_api_connector.py
│ ├── test_gateway_service.py
│ ├── test_github_service.py
│ ├── test_llm_service.py
│ ├── test_pdf_service.py
│ └── test_qdrant_services.py
├── utils/
│ ├── __init__.py
│ ├── dataflow_processing_utils.py
│ ├── get_qdrant.py
│ ├── kafka_setup.py
│ ├── kafka_utils.py
│ ├── langchain_callback_logger.py
│ └── model_utils.py
├── docker-compose.yml
├── init.sql
├── main.py
├── pytest.ini
├── README.md
└── requirements.txt
- Python
- Git
- Docker
-
Clone the repository and switch to this branch (feat-custom-fastapi-sink):
git clone https://github.com/jgwentworth92/GutenbergV2.git cd GutenbergV2 git checkout -b feat-custom-fastapi-sink origin/feat-custom-fastapi-sink -
Create and activate a virtual environment according to your operating system:
-
On Linux:
python -m venv venv source venv/bin/activate -
On Windows:
python -m venv venv venv\Scripts\activate
-
-
Install the dependencies:
pip install -r requirements.txt pre-commit install
-
Set up your environment variables: Create a
.envfile in the root directory and add the required environment variables:GITHUB_TOKEN=[your_github_token] BROKERS="kafka_b:9094" INPUT_TOPIC=repos-topic OUTPUT_TOPIC=github-commits-out PROCESSED_TOPIC=addtovectordb CONSUMER_CONFIG={"bootstrap.servers": "kafka_b:9094","auto.offset.reset": "earliest","group.id": "consumer_group","enable.auto.commit": "True"} PRODUCER_CONFIG={"bootstrap.servers": "kafka_b:9094"} OPENAI_API_KEY=your_openai_api_key TEMPLATE = "You are an assistant whose job is to create detailed descriptions of what the provided code files do.Please review the code below and explain its functionality in detail.Code:{text}" VECTORDB_TOPIC_NAME="QdrantOutput" POSTGRES_HOSTNAME=postgres POSTGRES_PORT=5432 POSTGRES_USER=[user] POSTGRES_DB=myappdb POSTGRES_PASSWORD=[password] RESOURCE_TOPIC=resource_topic PDF_INPUT=pdfInput MODEL_PROVIDER="fake" LOCAL_LLM_URL = "http://[your_ip_address]:1234/v1" smtp_username=[username] smtp_password=[password] GITHUB_TOPIC=github_topic MAILTRAP_USERNAME=[your_mailtrap_username] MAILTRAP_PASSWORD=[your_mailtrap_password]
-
Add Streamlit frontend to docker-compose.yml: Add the following service to your
docker-compose.ymlfile:streamlit: image: jgcapworh92/gutenberg-streamlit-frontend:latest container_name: streamlit_frontend ports: - "8501:8501" environment: - API_URL=http://fastapi:8000 depends_on: - fastapi restart: always
Note: The system will automatically create recovery partitions if they are not manually created when the project is started.
- Set up OpenAI API key:
Create an account on OpenAI and get an API key. Add the key to the
.envfile. - Set the model provider to OpenAI:
Set the
MODEL_PROVIDERenvironment variable toopenaiin the.envfile.
The system is known to work with LMStudio, but it should theoretically work with any OpenAI API-compatible system.
- Set up the local model:
Install LMStudio (or another backend) and turn on the server. Here is a tutorial. Recommended model:
lmstudio-community/Mistral-7B-Instruct-v0.3-GGUF - Set the model provider to Local:
Set the
MODEL_PROVIDERenvironment variable tolmstudioin the.envfile. - Set the local model URL:
Set the
LOCAL_LLM_URLenvironment variable to the URL of the local model server in the.envfile. In the case of LMStudio, the URL ishttp://[your_ip_address]:1234/v1. Make sure to use the correct IP address for your computer, as Docker containers cannot accesslocalhost.
The fake model enables the system to bypass the use of a real API. It's a straightforward and rapid solution, making it particularly useful for testing purposes.
- Set the model provider to Fake:
Set the
MODEL_PROVIDERenvironment variable tofakein the.envfile.
To start all services, navigate to the root directory of your project where the docker-compose.yml file is located and run the following command:
docker-compose up --buildThis will build and start all the services required, including running the Alembic migration scripts automatically.
-
Kafka UI: Access at http://localhost:8080/
-
Gutenberg Ingestion API: Access the FastAPI Swagger UI at http://localhost:8000/docs
- This API provides routes for user and job CRUD operations.
- Users can submit jobs to start event-driven microservices.
- The FastAPI app adds entries to the PostgreSQL database.
- Debezium watches the resource table and produces changes to a Kafka topic.
- For more detailed information about the Gutenberg Ingestion API, please visit: https://github.com/jgwentworth92/Gutenberg-Ingestion-API
-
Streamlit Frontend: Access at http://localhost:8501
- Use this frontend to submit and view documents in the system.
-
Qdrant Web UI: Access at http://localhost:6333/dashboard#/collections
- Use this to view generated summaries.
- Register a new user through the FastAPI or Streamlit interface.
- You will receive a verification email in your Mailtrap.io inbox.
- Log in to Mailtrap.io and manually verify the user by clicking the verification link in the test email.
- After verification, you can log in to the system and submit jobs.
-
Open the Kafka Web UI at http://localhost:8080/
-
Select "Topics" from the left-hand menu. If the menu is hidden, click on the hamburger icon on the top left.
-
Click on "repos-topic" from the list of topics.
-
Click on the "Produce Message" button on the top right.
-
Enter the GitHub repo owner and URL in the "Value" field, in this format:
{ "owner": "octocat", "repo_name": "Hello-World" }Make sure it is a public repo, or it is a repo you currently have access to via the GitHub token in the
.envfile. Leave all other values as default. -
Click on the "Produce Message" button at the bottom of the dialog to add the repo to the topic.
The system will automatically process the repo and generate summaries using the LLM, via the provider specified in the .env file.
The application configuration is managed using Pydantic settings. Modify the config/config_setting.py file to update the configuration settings.
The system runs the dataflows automatically. To run the dataflows manually, use the following command format, replacing (filename) with the actual filename of the dataflow script without the .py extension:
python -m bytewax.run -w3 dataflows.(filename)For example, to run the GitHub commit processing dataflow:
python -m bytewax.run -w3 dataflows.github_commit_processingAnd to run the commit summary service dataflow:
python -m bytewax.run -w3 dataflows.commit_summary_serviceAnd to run the add to Qdrant service dataflow:
python -m bytewax.run -w3 dataflows.add_qdrant_serviceTo run the tests, use the pytest framework:
pytest .The tests are located in the tests/ directory and cover the GitHub service, message processing service, and dataflows.