Skip to content

Latest commit

 

History

History
345 lines (286 loc) · 11.7 KB

File metadata and controls

345 lines (286 loc) · 11.7 KB

kbol - Knowledge Base from Technical Books

Overview

A focused RAG (Retrieval Augmented Generation) system for programming books that lets you query and explore your technical library using natural language.

flowchart LR
    PDFs[Technical Books] --> Chunker[Text Chunker]
    Chunker --> Embedder[Ollama Embedder]
    Embedder --> VectorDB[(Vector DB)]
    Query[User Query] --> Semantic[Semantic Search]
    VectorDB --> Semantic
    Semantic --> LLM[Ollama LLM]
    LLM --> Response[Response]
Loading

System Architecture

Component Overview

graph TD
    CLI[Command Line Interface] --> Core[Core Engine]
    Core --> Indexer[Document Indexer]
    Core --> Search[Search Engine]
    Core --> Embedder[Embedding Service]
    Core --> LLM[LLM Service]
    
    Indexer --> Chunker[Text Chunker]
    Indexer --> DocTracker[Document Tracker]
    
    Search --> DB[(Vector Database)]
    
    subgraph Processing Pipeline
        Chunker --> Embedder
        Embedder --> DB
    end
    
    subgraph Query Pipeline
        Search --> LLM
        DB --> Search
    end
Loading

Usage

Processing Books

from kbol.indexer import BookIndexer
from pathlib import Path

indexer = BookIndexer()
books_dir = Path("data/books")
results = await indexer.process_books(books_dir)
print(f"Processed {len(results)} chunks")

Example Queries

Below are powerful example queries that leverage the diverse technical and theoretical knowledge in the book collection:

Technical Architecture Queries

# Technical Architecture Patterns
poetry run kbol query "Compare event-driven, flow-based, and microservice architectural patterns"
poetry run kbol query "What are the key metrics for evaluating microservices architecture success?"
poetry run kbol query "How does Domain-Driven Design evolve when moving from monoliths to microservices?"

Functional Programming Deep Dives

# Functional Programming Concepts
poetry run kbol query "Explain monads using examples from Clojure and Scala"
poetry run kbol query "Compare approaches to immutability and state management in Python vs Clojure"
poetry run kbol query "Show patterns for combining functional and reactive programming"

Cross-Disciplinary Insights

# Cross-Domain Knowledge Synthesis
poetry run kbol query "How do behavioral science insights inform better API design?"
poetry run kbol query "What principles from critical thinking apply to software architecture?"
poetry run kbol query "How do concepts from systems thinking apply across microservices and macroeconomics?"

Practical Development Examples

# Practical Implementation Patterns
poetry run kbol query "Show me Clojure examples of map and reduce with real-world use cases"
poetry run kbol query "What are the best practices for testing event-driven microservices?"
poetry run kbol query "Compare Python and Clojure approaches to handling concurrency"

API and Integration Patterns

# API Design and Evolution
poetry run kbol query "What patterns emerge for managing API evolution in microservices?"
poetry run kbol query "How do micro-frontends impact API design and management?"
poetry run kbol query "Compare REST, GraphQL, and event-driven API patterns"

These queries demonstrate the system’s ability to synthesize knowledge across:

  • Software architecture and systems design
  • Functional and reactive programming paradigms
  • Cross-disciplinary insights from philosophy and economics
  • Practical development patterns and best practices
  • Modern API and integration approaches

When executed, C-c C-v t will tangle these examples to scripts/example_queries.sh, creating a ready-to-use script of example queries.

Example Knowledge Base Queries

ML/AI Foundations

# Check ML Engineering fundamentals
time poetry run kbol query "What does the book 'Machine Learning Engineering with Python' cover about MLOps and production deployment?" | tee data/answers/foundations-mlp.md | head 

# Verify feature engineering coverage
poetry run kbol query "What practical examples are shown in 'Python Feature Engineering Cookbook'?" | tee data/answers/foundations-fe.md | head 

LLM and NLP Content

# Compare LLM books
poetry run kbol query "How do 'Building LLM Powered Applications' and the 'LLM Engineer's Handbook' differ in their coverage of LLM implementation?"
# Explore NLP progression
poetry run kbol query "How does 'Mastering NLP from Foundations to LLMs' structure the learning path from basic NLP to advanced LLMs?"

Specialized ML Techniques

# XGBoost applications
poetry run kbol query "What are the key concepts covered in 'XGBoost for Regression Predictive Modeling' regarding time series analysis?"
# Genetic algorithms
poetry run kbol query "What practical Python examples are provided in 'Hands-On Genetic Algorithms with Python'?"

Deep Learning and PyTorch

# PyTorch mastery
poetry run kbol query "What advanced PyTorch concepts are covered in 'Mastering PyTorch' versus basic implementations?"
# Compare implementations
poetry run kbol query "How does 'Machine Learning with PyTorch and Scikit-Learn' approach model development differently from 'Mastering PyTorch'?"

RAG and Vector Systems

# RAG implementation
poetry run kbol query "How does 'RAG-Driven Generative AI' approach vector databases and embedding strategies?"
# Architecture considerations
poetry run kbol query "What are the main architectural patterns discussed in 'RAG-Driven Generative AI' for building production RAG systems?"

Causal Inference

# Compare approaches
poetry run kbol query "Compare the approaches to causal inference between 'Causal Inference and Discovery in Python' and 'Causal Inference in R'"
# Implementation details
poetry run kbol query "What practical examples are provided in 'Causal Inference and Discovery in Python' for causal discovery?"

AI Security

# Security applications
poetry run kbol query "What cybersecurity use cases are covered in 'Artificial Intelligence for Cybersecurity'?"
# Implementation patterns
poetry run kbol query "What are the main security patterns and frameworks discussed in 'Artificial Intelligence for Cybersecurity'?"

Mathematical Foundations

# Math concepts
poetry run kbol query "What statistical concepts from '15 Math Concepts Every Data Scientist Should Know' are applied in 'Bayesian Analysis with Python'?"
# Bayesian applications
poetry run kbol query "How does 'Bayesian Analysis with Python' implement MCMC sampling in practice?"

Time Series Analysis

# Forecasting techniques
poetry run kbol query "How does 'Modern Time Series Forecasting with Python' handle different forecasting techniques?"
# Implementation patterns
poetry run kbol query "What are the main architectural patterns for time series prediction discussed in 'Modern Time Series Forecasting with Python'?"

Reinforcement Learning

# Implementation examples
poetry run kbol query "What practical implementations are covered in 'Deep Reinforcement Learning Hands-On'?"
# Advanced concepts
poetry run kbol query "How does 'Deep Reinforcement Learning Hands-On' approach advanced topics like multi-agent systems?"

Implementation Details

Vector Database Schema

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS book_chunks (
    id SERIAL PRIMARY KEY,
    book_title TEXT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(384),
    page_number INTEGER,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX IF NOT EXISTS book_chunks_embedding_idx ON book_chunks 
USING ivfflat (embedding vector_cosine_ops);

Processing Pipeline

sequenceDiagram
    participant PDF as PDF Books
    participant Chunker as Text Chunker
    participant Embedder as Ollama Embedder
    participant DB as Vector DB
    
    PDF->>Chunker: Raw Text
    Chunker->>Chunker: Split into Chunks
    loop Each Chunk
        Chunker->>Embedder: Text Chunk
        Embedder->>Embedder: Generate Embedding
        Embedder->>DB: Store Chunk + Embedding
    end
Loading

Quick Start

  1. Setup your environment:
    make setup
        
  2. Run the complete demo with a sample book:
    make demo
        
  3. Try some example queries:
    # Query about specific topics
    poetry run kbol query "Explain monads from the functional programming books"
    
    # Find code examples
    poetry run kbol query "Show me Clojure examples of map and reduce"
    
    # Compare concepts
    poetry run kbol query "Compare Python and Clojure approaches to immutability"
        

Development Commands

CommandDescription
make setupInitial setup of development environment
make demoRun complete demo pipeline
make load-booksLink books from your collection
make process-booksProcess books into chunks with embeddings
make statsShow statistics about processed books
make cleanClean generated files and directories

Vector Database Schema

The system uses a PostgreSQL database with vector similarity search capabilities:

CREATE TABLE book_chunks (
    id SERIAL PRIMARY KEY,
    book_title TEXT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(384),
    page_number INTEGER,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

License

MIT

Author

Jason Walsh (https://wal.sh)