diff --git a/docs/concepts/README.md b/docs/concepts/README.md new file mode 100644 index 000000000..3d2059d8e --- /dev/null +++ b/docs/concepts/README.md @@ -0,0 +1,47 @@ +# What Is This? + +You give it documents. It finds the ideas, connects them, and remembers what contradicts what. + +## The 30-Second Version + +Feed the system your documents - research papers, notes, articles, reports. It reads them and extracts the key ideas. Then it connects those ideas: this concept *supports* that one, this claim *contradicts* that one, this cause *leads to* that effect. + +Unlike a search engine that just finds keywords, this system understands meaning. Ask "what causes inflation?" and it finds concepts related to inflation's causes - even if those exact words don't appear in your documents. + +Unlike a chatbot that makes things up, every idea traces back to its source. You can always ask "where did this come from?" and get a real answer. + +## The Real Point + +This isn't just a search tool for humans. It's infrastructure for AI that can reason about what it knows. + +Most AI "memory" is just similarity search - find things that look like what you asked for. This system tracks: + +- **Grounding**: How well-supported is this idea? One source or twenty? +- **Contradiction**: Do sources disagree? Which ones? +- **Provenance**: Where exactly did this idea come from? + +That's the foundation for AI that doesn't just retrieve information but *reasons about how reliable it is*. + +Current state: AI assistants can query the system via standard protocols (MCP). +Future state: The knowledge graph becomes part of how AI thinks, not just something it queries. + +## What Can You Do With It? + +**As a human:** +- Search your documents by meaning, not just keywords +- See how ideas connect across different sources +- Find where your sources contradict each other +- Trace any claim back to its origin + +**As an AI agent:** +- Query persistent memory that survives across sessions +- Get grounded answers with confidence levels +- Reason about contradictions and uncertainty +- Build knowledge incrementally over time + +## Next Steps + +- [How It Works](how-it-works.md) - The conceptual model (still no code) +- [Glossary](glossary.md) - Terms explained in plain language +- [Using the System](../using/README.md) - Getting started as a user +- [Operating the System](../operating/README.md) - Deploying and maintaining diff --git a/docs/concepts/glossary.md b/docs/concepts/glossary.md new file mode 100644 index 000000000..18fd99ef4 --- /dev/null +++ b/docs/concepts/glossary.md @@ -0,0 +1,172 @@ +# Glossary + +Terms used in this system, explained in plain language. + +--- + +## Concept + +An idea extracted from a document. Not a keyword - a meaningful unit of thought. + +Examples: +- "Climate change increases extreme weather events" +- "The mitochondria is the powerhouse of the cell" +- "Napoleon was defeated at Waterloo in 1815" + +Concepts can be claims, definitions, events, entities, or other types. Each concept has a grounding score indicating how well-supported it is. + +--- + +## Relationship + +A connection between two concepts. The system discovers how ideas relate to each other. + +Common relationship types: +- **Supports**: One concept provides evidence for another +- **Contradicts**: Two concepts are in tension or conflict +- **Implies**: If one is true, the other follows +- **Causes**: One concept leads to another +- **Is Part Of**: One concept belongs to a larger whole +- **Is Example Of**: One concept illustrates another + +--- + +## Grounding + +A measure of how well-supported a concept is. High grounding means many sources confirm the idea. Low grounding means few sources mention it. + +Grounding considers: +- Number of sources mentioning the concept +- Whether sources agree or disagree +- Strength of the evidence in each source + +A grounding score ranges from -1.0 (strongly contradicted) to +1.0 (strongly supported). Near zero means mixed or insufficient evidence. + +--- + +## Source + +A chunk of original text from a document. Sources are the evidence - they're what concepts are extracted from. + +Each source preserves: +- The actual text +- Which document it came from +- Location information (for highlighting and reference) + +When you want to verify a concept, you trace it back to its sources. + +--- + +## Evidence + +The link between a concept and a source. Evidence shows *which specific text* led to *which concept*. + +Multiple sources can provide evidence for the same concept. When they do, the concept's grounding increases. + +--- + +## Provenance + +The chain of origin for any piece of knowledge. Provenance answers "where did this come from?" + +For a concept, provenance traces: +Document → Chunk → Extraction → Concept + +This matters because claims without provenance can't be verified. + +--- + +## Ontology + +A collection of related knowledge. Think of it as a named knowledge base. + +You might create separate ontologies for: +- "Research Papers" +- "Company Documentation" +- "Meeting Notes" + +Ontologies can be queried separately or together. They help organize knowledge into meaningful collections. + +--- + +## Epistemic Status + +The reliability classification of knowledge. Describes whether something is well-established, contested, or uncertain. + +Possible statuses: +- **Affirmative**: Well-supported, high confidence +- **Contested**: Sources disagree +- **Contradictory**: Strong evidence against +- **Insufficient Data**: Not enough sources to judge +- **Historical**: Considered accurate for its time period + +--- + +## Semantic Search + +Finding concepts by meaning, not just matching keywords. + +Search for "economic downturn" and find concepts about recessions, market crashes, and financial crises - even if none use those exact words. + +This works because concepts are compared by what they mean, not just what words they contain. + +--- + +## Contradiction + +When sources disagree. The system tracks contradictions rather than hiding them. + +Example: One paper says "coffee prevents heart disease" while another says "coffee increases heart disease risk." Both concepts are stored with their sources, and the contradiction is noted. + +This lets you (or an AI) reason about disagreements rather than pretending they don't exist. + +--- + +## Ingestion + +The process of adding documents to the system. During ingestion: +1. Documents are stored +2. Text is split into chunks +3. Concepts are extracted from each chunk +4. Relationships are discovered +5. Grounding is calculated + +--- + +## MCP (Model Context Protocol) + +A standard way for AI assistants to use external tools. This system provides MCP tools so AI agents like Claude can: +- Search concepts +- Explore relationships +- Query grounding +- Ingest new documents + +This is how AI assistants gain persistent memory. + +--- + +## Chunk + +A portion of a document, roughly page-sized. Documents are split into chunks for processing. + +Chunks preserve context - they overlap slightly so ideas that span a page break aren't lost. + +--- + +## Instance + +A specific occurrence of a concept in a source. If the same concept appears in three documents, there are three instances but one concept. + +Instances are the individual sightings. The concept is the aggregated understanding. + +--- + +## Diversity Score + +A measure of how broadly connected a concept is. High diversity means the concept connects to many different topics. Low diversity means it's narrowly focused. + +Useful for finding concepts that bridge different domains. + +--- + +Next: [Using the System](../using/README.md) - Getting started as a user diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md new file mode 100644 index 000000000..ef9d56f81 --- /dev/null +++ b/docs/concepts/how-it-works.md @@ -0,0 +1,132 @@ +# How It Works + +A conceptual overview. No code, no implementation details - just the model. + +## The Flow + +``` +Documents → Extraction → Connection → Grounding +``` + +### 1. Documents Go In + +You provide documents: PDFs, text files, markdown, web pages. The system stores the original text so you can always go back to the source. + +Documents are split into manageable chunks - roughly page-sized pieces that can be processed individually while preserving context. + +### 2. Ideas Come Out + +Each chunk is analyzed to extract the key ideas. Not keywords - *concepts*. + +A concept is a meaningful unit of thought: "inflation reduces purchasing power" or "sleep deprivation impairs memory" or "the French Revolution began in 1789." + +The extraction finds: +- What the concept is (the idea itself) +- What type it is (claim, definition, event, entity, etc.) +- How it relates to other concepts in the same chunk + +### 3. Connections Form + +Concepts don't exist in isolation. The system discovers relationships: + +| Relationship | Meaning | +|--------------|---------| +| **Supports** | This concept provides evidence for that one | +| **Contradicts** | These concepts are in tension | +| **Implies** | If this is true, that follows | +| **Causes** | This leads to that | +| **Part of** | This belongs to a larger whole | + +When a new concept matches one that already exists, they're merged. The connection grows stronger. When they conflict, both views are preserved with their sources. + +### 4. Grounding Accumulates + +As more documents come in, concepts gain *grounding* - a measure of how well-supported they are. + +- A concept mentioned in one source has low grounding +- The same concept confirmed across many sources has high grounding +- A concept that some sources support and others contradict has mixed grounding + +Grounding isn't just a count. It considers: +- How many sources mention the concept +- Whether sources agree or disagree +- The strength of the supporting evidence + +## What Gets Remembered + +The system maintains five types of information: + +### Concepts +The ideas themselves. Each concept has: +- A name or description +- A type (claim, entity, event, etc.) +- Grounding score (how well-supported) + +### Relationships +How concepts connect. Each relationship has: +- Source concept and target concept +- Type (supports, contradicts, implies, etc.) +- Evidence for why this connection exists + +### Sources +The original text chunks. Each source has: +- The actual text +- Which document it came from +- Where in the document (for highlighting) + +### Evidence +The link between concepts and sources. Shows exactly which text led to which concept. + +### Ontologies +Collections of related knowledge. You might have one ontology for "climate research" and another for "company policies." They can be queried separately or together. + +## How Queries Work + +When you search, you're not matching keywords. You're finding concepts similar in *meaning* to what you're looking for. + +Ask about "economic downturn" and you'll find concepts about recessions, market crashes, and financial crises - even if none of them use the exact phrase "economic downturn." + +Results include: +- The matching concepts +- Their grounding scores (how reliable) +- The sources they came from (where to verify) +- Related concepts (what else connects) + +## How Contradiction Works + +Traditional databases assume consistency - if two things conflict, one is wrong. This system assumes **reality is messy**. + +When sources disagree, the system: +1. Keeps both viewpoints +2. Records which sources support which view +3. Notes that a contradiction exists +4. Lets you (or an AI) reason about the disagreement + +This is crucial for: +- Research where experts disagree +- Historical documents with conflicting accounts +- Evolving knowledge where old information conflicts with new + +## The Epistemic Layer + +*Epistemic* means "relating to knowledge." This system has an epistemic layer that most databases lack. + +It doesn't just store *what* is claimed. It tracks: +- **Confidence**: How well-supported is this claim? +- **Controversy**: Do sources agree or disagree? +- **Provenance**: Where did this claim originate? +- **Freshness**: When was this last confirmed? + +This matters because knowledge isn't certain. An AI using this system can say "this is well-established" vs "this is contested" vs "this comes from a single source and should be verified." + +## What This Enables + +For humans: Search that understands meaning. Sources that trace back. Contradictions made visible. + +For AI agents: Memory that persists. Confidence that's grounded. Uncertainty that's explicit. + +For both: Knowledge that accumulates over time without losing track of where it came from. + +--- + +Next: [Glossary](glossary.md) - Terms explained in plain language diff --git a/docs/operating/README.md b/docs/operating/README.md new file mode 100644 index 000000000..4d17ebc36 --- /dev/null +++ b/docs/operating/README.md @@ -0,0 +1,97 @@ +# Operating the Knowledge Graph System + +Guides for deploying, configuring, and maintaining the platform. + +## Choose Your Path + +| I want to... | Go to | +|--------------|-------| +| Try it locally in 5 minutes | [Quick Start](quick-start.md) | +| Deploy for real use | [Production Deployment](production.md) | +| Understand all the settings | [Configuration Reference](configuration.md) | +| Upgrade to a new version | [Upgrading](upgrading.md) | +| Back up my data | [Backup & Restore](backup-restore.md) | +| Fix something broken | [Troubleshooting](troubleshooting.md) | + +## Architecture Overview + +The system runs as Docker containers: + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Docker Containers │ +├──────────────┬──────────────┬──────────────┬────────────────┤ +│ postgres │ garage │ api │ web │ +│ (database) │ (storage) │ (backend) │ (frontend) │ +└──────────────┴──────────────┴──────────────┴────────────────┘ +``` + +- **postgres**: Apache AGE (PostgreSQL with graph extensions) - stores concepts, relationships, metadata +- **garage**: S3-compatible object storage - stores original documents +- **api**: Python FastAPI server - handles extraction, queries, authentication +- **web**: React frontend - visual exploration interface + +An **operator** container manages setup, migrations, and maintenance tasks. + +## Deployment Modes + +### Interactive (Default) +Run `./operator.sh init` and follow the prompts. Good for first-time setup on a single machine. + +### Headless +Run `./operator.sh init --headless` with command-line flags. Good for: +- Automated deployments +- CI/CD pipelines +- Multi-machine setups +- Scripted configuration + +### Development +Run `./operator.sh init` with dev mode enabled. Adds: +- Hot reload for code changes +- Simple default passwords +- Local image builds + +### Production +Deploy with GHCR images, HTTPS, and proper secrets. Covered in [Production Deployment](production.md). + +## What You'll Need + +**Required:** +- Docker and Docker Compose +- 8GB RAM minimum (16GB+ recommended for GPU acceleration) +- 20GB disk space for system, more for documents + +**Optional:** +- NVIDIA GPU for faster extraction +- Domain name for HTTPS access +- AI provider API key (OpenAI or Anthropic) + +## Quick Commands + +```bash +# First-time setup +./operator.sh init + +# Daily start +./operator.sh start + +# Stop everything +./operator.sh stop + +# Check status +./operator.sh status + +# View logs +./operator.sh logs api # API logs +./operator.sh logs # All logs + +# Maintenance +./operator.sh upgrade # Pull updates and migrate +./operator.sh backup # Backup database +./operator.sh shell # Enter operator shell for admin tasks +``` + +## Next Steps + +- [Quick Start](quick-start.md) - Get running in 5 minutes +- [Production Deployment](production.md) - Full deployment guide diff --git a/docs/operating/backup-restore.md b/docs/operating/backup-restore.md new file mode 100644 index 000000000..3b415e46e --- /dev/null +++ b/docs/operating/backup-restore.md @@ -0,0 +1,147 @@ +# Backup & Restore + +Protecting your knowledge graph data. + +## What Gets Backed Up + +| Component | Contains | Backup Method | +|-----------|----------|---------------| +| PostgreSQL | Concepts, relationships, users, config | `pg_dump` | +| Garage | Original documents | S3-compatible tools | + +## Quick Backup + +```bash +./operator.sh backup +``` + +Creates a timestamped SQL dump in the backups directory. + +## Backup Locations + +Default backup location: `./backups/` + +```bash +ls -la backups/ +# knowledge_graph_2026-01-18_120000.sql +``` + +## Manual Database Backup + +```bash +# Backup to file +docker exec kg-postgres pg_dump -U admin -d knowledge_graph > backup.sql + +# Compressed +docker exec kg-postgres pg_dump -U admin -d knowledge_graph | gzip > backup.sql.gz +``` + +## Restore + +```bash +./operator.sh restore /path/to/backup.sql +``` + +Or manually: + +```bash +# Drop and recreate database +docker exec -i kg-postgres psql -U admin -c "DROP DATABASE IF EXISTS knowledge_graph;" +docker exec -i kg-postgres psql -U admin -c "CREATE DATABASE knowledge_graph;" + +# Restore +docker exec -i kg-postgres psql -U admin -d knowledge_graph < backup.sql +``` + +## Garage (Document Storage) Backup + +Garage uses S3-compatible storage. Back up using standard S3 tools: + +```bash +# Using AWS CLI (configured for Garage) +aws --endpoint-url http://localhost:3900 s3 sync s3://kg-storage ./garage-backup/ + +# Using rclone +rclone sync garage:kg-storage ./garage-backup/ +``` + +Or copy the Garage data directory: +```bash +cp -r /srv/docker/data/knowledge-graph/garage ./garage-backup/ +``` + +## Automated Backups + +Set up a cron job for regular backups: + +```bash +# Edit crontab +crontab -e + +# Add daily backup at 2 AM +0 2 * * * cd /path/to/knowledge-graph-system && ./operator.sh backup +``` + +## Backup Retention + +Clean up old backups periodically: + +```bash +# Keep last 7 days +find ./backups -name "*.sql" -mtime +7 -delete +``` + +## Disaster Recovery + +Full recovery procedure: + +1. **Fresh installation:** + ```bash + git clone https://github.com/aaronsb/knowledge-graph-system.git + cd knowledge-graph-system + ./operator.sh init --headless ... + ``` + +2. **Stop services:** + ```bash + ./operator.sh stop + ``` + +3. **Restore database:** + ```bash + ./operator.sh restore /path/to/backup.sql + ``` + +4. **Restore Garage data:** + ```bash + cp -r /path/to/garage-backup/* /srv/docker/data/knowledge-graph/garage/ + ``` + +5. **Start services:** + ```bash + ./operator.sh start + ``` + +6. **Verify:** + ```bash + ./operator.sh status + kg health + ``` + +## Testing Backups + +Regularly verify backups work: + +```bash +# Create test environment +docker run -d --name backup-test -e POSTGRES_PASSWORD=test postgres:16 + +# Restore to test +docker exec -i backup-test psql -U postgres < backup.sql + +# Verify +docker exec backup-test psql -U postgres -d knowledge_graph -c "SELECT COUNT(*) FROM concepts;" + +# Cleanup +docker rm -f backup-test +``` diff --git a/docs/operating/configuration.md b/docs/operating/configuration.md new file mode 100644 index 000000000..cbc6d64d2 --- /dev/null +++ b/docs/operating/configuration.md @@ -0,0 +1,202 @@ +# Configuration Reference + +All configuration options for the knowledge graph system. + +## Configuration Files + +| File | Purpose | +|------|---------| +| `.env` | Environment variables (secrets, database, AI provider) | +| `.operator.conf` | Operator settings (container names, compose files) | +| `docker/nginx.prod.conf` | Nginx configuration (for HTTPS) | + +## Environment Variables (.env) + +### Core Secrets + +Generated during initialization. **Never edit manually.** + +| Variable | Purpose | +|----------|---------| +| `ENCRYPTION_KEY` | Fernet key for encrypting API keys at rest | +| `OAUTH_SIGNING_KEY` | Signs JWT access tokens | +| `INTERNAL_KEY_SERVICE_SECRET` | Service-to-service authentication | + +### Database + +| Variable | Default | Description | +|----------|---------|-------------| +| `POSTGRES_HOST` | `localhost` | Database host (use `postgres` in containers) | +| `POSTGRES_PORT` | `5432` | Database port | +| `POSTGRES_DB` | `knowledge_graph` | Database name | +| `POSTGRES_USER` | `admin` | Database user | +| `POSTGRES_PASSWORD` | (generated) | Database password | + +### Web Configuration + +| Variable | Default | Description | +|----------|---------|-------------| +| `WEB_HOSTNAME` | `localhost:3000` | Public hostname for web access | + +Used for: +- OAuth redirect URIs (`https://{WEB_HOSTNAME}/callback`) +- API URL in frontend (`https://{WEB_HOSTNAME}/api`) +- OAuth client registration + +### AI Provider + +These settings only apply if `DEVELOPMENT_MODE=true`. Otherwise, configuration is loaded from the database. + +| Variable | Default | Description | +|----------|---------|-------------| +| `DEVELOPMENT_MODE` | `false` | Load config from .env (true) or database (false) | +| `AI_PROVIDER` | `openai` | `openai`, `anthropic`, or `mock` | +| `OPENAI_API_KEY` | - | OpenAI API key | +| `ANTHROPIC_API_KEY` | - | Anthropic API key | + +Model configuration: + +| Variable | Default | Description | +|----------|---------|-------------| +| `OPENAI_EXTRACTION_MODEL` | `gpt-4o` | Model for concept extraction | +| `OPENAI_EMBEDDING_MODEL` | `text-embedding-3-small` | Model for embeddings | +| `ANTHROPIC_EXTRACTION_MODEL` | `claude-sonnet-4-20250514` | Anthropic extraction model | + +### Object Storage (Garage) + +| Variable | Default | Description | +|----------|---------|-------------| +| `GARAGE_S3_ENDPOINT` | `http://garage:3900` | Garage S3 endpoint | +| `GARAGE_REGION` | `garage` | Garage region name | +| `GARAGE_BUCKET` | `kg-storage` | Default bucket name | +| `GARAGE_RPC_SECRET` | (generated) | Cluster coordination secret | + +### Job Scheduler + +| Variable | Default | Description | +|----------|---------|-------------| +| `JOB_CLEANUP_INTERVAL` | `3600` | Cleanup interval (seconds) | +| `JOB_APPROVAL_TIMEOUT` | `24` | Cancel unapproved jobs after (hours) | +| `JOB_COMPLETED_RETENTION` | `48` | Delete completed jobs after (hours) | +| `JOB_FAILED_RETENTION` | `168` | Delete failed jobs after (hours) | +| `MAX_CONCURRENT_JOBS` | `4` | Maximum parallel ingestion jobs | + +### OAuth Settings + +| Variable | Default | Description | +|----------|---------|-------------| +| `ACCESS_TOKEN_EXPIRE_MINUTES` | `60` | Token validity period | + +### AMD GPU (Optional) + +Only set if needed for AMD GPU detection: + +| Variable | Description | +|----------|-------------| +| `HSA_OVERRIDE_GFX_VERSION` | Override GPU architecture (e.g., `10.3.0`) | +| `ROCR_VISIBLE_DEVICES` | Limit visible GPUs (e.g., `0`) | +| `ROCM_VERSION` | ROCm wheel version (`rocm60`, `rocm61`) | + +## Operator Configuration (.operator.conf) + +Created during initialization. Controls operator behavior. + +| Variable | Default | Description | +|----------|---------|-------------| +| `CONTAINER_PREFIX` | `knowledge-graph` | Container name prefix | +| `CONTAINER_SUFFIX` | - | Container name suffix (e.g., `-dev`) | +| `COMPOSE_FILE` | `docker-compose.yml` | Base compose file | +| `IMAGE_SOURCE` | `local` | `local` or `ghcr` | +| `GPU_MODE` | `auto` | GPU mode | + +### Container Naming + +Container names follow these patterns: + +| Service | Development | Production | +|---------|-------------|------------| +| PostgreSQL | `knowledge-graph-postgres` | `kg-postgres` | +| Garage | `knowledge-graph-garage` | `kg-garage` | +| API | `kg-api-dev` | `kg-api` | +| Web | `kg-web-dev` | `kg-web` | +| Operator | `kg-operator` | `kg-operator` | + +The `--container-prefix=kg` flag gives production naming. + +## Compose File Selection + +The operator automatically selects compose files based on configuration: + +| Configuration | Compose Files Used | +|---------------|-------------------| +| Default | `docker-compose.yml` | +| GHCR images | `docker-compose.yml` + `docker-compose.ghcr.yml` | +| Production | `docker-compose.prod.yml` | +| NVIDIA GPU | + `docker-compose.gpu-nvidia.yml` | +| AMD GPU | + `docker-compose.gpu-amd.yml` | +| Dev mode | + `docker-compose.dev.yml` | + +## Runtime Configuration + +Some settings are configured at runtime via the operator shell: + +```bash +./operator.sh shell +``` + +### AI Provider Configuration + +```bash +# Set extraction provider +configure.py ai-provider --provider anthropic --model claude-sonnet-4 + +# Store API key (encrypted in database) +configure.py api-key anthropic --key "sk-ant-..." + +# View current configuration +configure.py show +``` + +### User Management + +```bash +# Create user +configure.py create-user --username alice --email alice@example.com + +# Reset password +configure.py reset-password --username admin + +# List users +configure.py list-users +``` + +## Nginx Configuration + +For HTTPS deployments, edit `docker/nginx.prod.conf`: + +```nginx +server { + listen 443 ssl http2; + server_name your-hostname.example.com; + + # SSL certificates + ssl_certificate /etc/nginx/certs/your-hostname.fullchain.cer; + ssl_certificate_key /etc/nginx/certs/your-hostname.key; + + # API proxy + location /api/ { + proxy_pass http://api:8000/; + # ... proxy settings + } + + # SPA routing + location / { + try_files $uri $uri/ /index.html; + } +} +``` + +## Next Steps + +- [Production Deployment](production.md) - Full deployment guide +- [Troubleshooting](troubleshooting.md) - Common issues diff --git a/docs/operating/production.md b/docs/operating/production.md new file mode 100644 index 000000000..324488494 --- /dev/null +++ b/docs/operating/production.md @@ -0,0 +1,270 @@ +# Production Deployment + +Full guide for deploying the knowledge graph system in production. + +## Overview + +Production deployment differs from quick start: +- Pre-built container images from GitHub Container Registry (GHCR) +- Headless (non-interactive) initialization +- HTTPS with real certificates +- Proper hostname configuration for OAuth +- GPU acceleration configured explicitly + +## Prerequisites + +- Docker and Docker Compose +- A server with 16GB+ RAM (8GB minimum) +- NVIDIA GPU recommended for faster extraction +- A domain name (for HTTPS) +- DNS control (for certificate issuance) + +## Headless Initialization + +For automated or scripted deployments, use headless mode: + +```bash +./operator.sh init --headless \ + --container-prefix=kg \ + --image-source=ghcr \ + --gpu=nvidia \ + --web-hostname=kg.example.com \ + --ai-provider=anthropic \ + --ai-model=claude-sonnet-4 \ + --ai-key="$ANTHROPIC_API_KEY" +``` + +### Required Parameters + +| Parameter | Description | +|-----------|-------------| +| `--headless` | Enable non-interactive mode | + +### Infrastructure Parameters + +| Parameter | Values | Default | Description | +|-----------|--------|---------|-------------| +| `--image-source` | `local`, `ghcr` | `local` | Where to get container images | +| `--gpu` | `auto`, `nvidia`, `amd`, `amd-host`, `mac`, `cpu` | `auto` | GPU acceleration mode | +| `--container-prefix` | `kg`, `knowledge-graph` | `knowledge-graph` | Container name prefix | +| `--compose-file` | path | `docker-compose.yml` | Base compose file | + +### Web Configuration + +| Parameter | Description | +|-----------|-------------| +| `--web-hostname` | Public hostname for web access (e.g., `kg.example.com`) | + +The hostname is used for: +- OAuth redirect URIs +- API URL in frontend configuration +- SSL certificate common name + +### AI Configuration + +| Parameter | Description | +|-----------|-------------| +| `--ai-provider` | `openai`, `anthropic`, or `openrouter` | +| `--ai-model` | Model name (e.g., `gpt-4o`, `claude-sonnet-4`) | +| `--ai-key` | API key for the provider | +| `--skip-ai-config` | Skip AI configuration entirely | + +### Other Options + +| Parameter | Description | +|-----------|-------------| +| `--password-mode` | `random` (secure) or `simple` (dev defaults) | +| `--container-mode` | `regular` or `dev` (hot reload) | +| `--skip-cli` | Skip CLI installation | + +## GPU Configuration + +### NVIDIA GPU + +```bash +./operator.sh init --headless --gpu=nvidia ... +``` + +Requires NVIDIA Container Toolkit installed on the host. + +### AMD GPU (ROCm) + +```bash +./operator.sh init --headless --gpu=amd ... +``` + +Uses ROCm PyTorch wheels inside the container. + +### AMD GPU (Host ROCm) + +```bash +./operator.sh init --headless --gpu=amd-host ... +``` + +Uses ROCm installed on the host system. + +### CPU Only + +```bash +./operator.sh init --headless --gpu=cpu ... +``` + +No GPU acceleration. Slower but works everywhere. + +## HTTPS Configuration + +### Using Let's Encrypt with DNS Validation + +1. **Install acme.sh on the host:** + ```bash + curl https://get.acme.sh | sh + ``` + +2. **Configure your DNS provider** (example: Porkbun): + ```bash + export PORKBUN_API_KEY="your-api-key" + export PORKBUN_SECRET_API_KEY="your-secret-key" + ``` + +3. **Issue the certificate:** + ```bash + ~/.acme.sh/acme.sh --issue \ + --dns dns_porkbun \ + -d kg.example.com + ``` + +4. **Install to a location the container can access:** + ```bash + mkdir -p /srv/docker/data/knowledge-graph/certs + ~/.acme.sh/acme.sh --install-cert -d kg.example.com \ + --key-file /srv/docker/data/knowledge-graph/certs/kg.example.com.key \ + --fullchain-file /srv/docker/data/knowledge-graph/certs/kg.example.com.fullchain.cer + ``` + +5. **Configure nginx** - edit `docker/nginx.prod.conf`: + ```nginx + server { + listen 443 ssl http2; + server_name kg.example.com; + + ssl_certificate /etc/nginx/certs/kg.example.com.fullchain.cer; + ssl_certificate_key /etc/nginx/certs/kg.example.com.key; + + # ... rest of config + } + ``` + +6. **Mount certificates in compose** - the `docker-compose.prod.yml` mounts: + ```yaml + volumes: + - /srv/docker/data/knowledge-graph/certs:/etc/nginx/certs:ro + ``` + +### Certificate Renewal + +acme.sh sets up automatic renewal via cron. After renewal, reload nginx: + +```bash +docker exec kg-web nginx -s reload +``` + +## Example: Full Production Deployment + +```bash +# On your production server +cd ~/knowledge-graph-system + +# Set environment variables +export ANTHROPIC_API_KEY="sk-ant-..." + +# Initialize with all production settings +./operator.sh init --headless \ + --container-prefix=kg \ + --image-source=ghcr \ + --gpu=nvidia \ + --web-hostname=kg.example.com \ + --ai-provider=anthropic \ + --ai-model=claude-sonnet-4 \ + --ai-key="$ANTHROPIC_API_KEY" + +# Verify everything is running +./operator.sh status + +# Check the web interface +curl -I https://kg.example.com +``` + +## Secrets and Security + +### Generated Secrets + +During initialization, these are generated and stored in `.env`: + +| Secret | Purpose | +|--------|---------| +| `ENCRYPTION_KEY` | Encrypts API keys at rest | +| `OAUTH_SIGNING_KEY` | Signs JWT tokens | +| `POSTGRES_PASSWORD` | Database password | +| `INTERNAL_KEY_SERVICE_SECRET` | Service-to-service auth | +| `GARAGE_RPC_SECRET` | Storage cluster coordination | + +**Never commit `.env` to version control.** + +### AI Provider Keys + +AI provider API keys (OpenAI, Anthropic) are stored encrypted in the database, not in `.env`. They're configured via: + +```bash +./operator.sh shell +configure.py api-key anthropic --key "sk-ant-..." +``` + +Or via the `--ai-key` flag during headless init. + +## Data Locations + +Default data paths (can be customized in compose files): + +| Data | Location | +|------|----------| +| PostgreSQL database | `/srv/docker/data/knowledge-graph/postgres` | +| Garage object storage | `/srv/docker/data/knowledge-graph/garage` | +| Model cache | `/srv/docker/data/knowledge-graph/hf_cache` | +| SSL certificates | `/srv/docker/data/knowledge-graph/certs` | + +## Upgrading + +See [Upgrading](upgrading.md) for version upgrade procedures. + +```bash +# Quick upgrade +./operator.sh upgrade + +# See what would change first +./operator.sh upgrade --dry-run +``` + +## Monitoring + +### Container Health + +```bash +./operator.sh status # Quick status +docker ps # Detailed container info +./operator.sh logs api # API logs +./operator.sh logs --follow # Tail all logs +``` + +### API Health Check + +```bash +curl http://localhost:8000/health +# Or via HTTPS: +curl https://kg.example.com/api/health +``` + +## Next Steps + +- [Configuration Reference](configuration.md) - All settings explained +- [Backup & Restore](backup-restore.md) - Protect your data +- [Troubleshooting](troubleshooting.md) - When things go wrong diff --git a/docs/operating/quick-start.md b/docs/operating/quick-start.md new file mode 100644 index 000000000..ac32781c2 --- /dev/null +++ b/docs/operating/quick-start.md @@ -0,0 +1,103 @@ +# Quick Start + +Get the knowledge graph running locally in 5 minutes. + +## Prerequisites + +- Docker and Docker Compose installed +- 8GB RAM available +- Git + +## Steps + +### 1. Clone the Repository + +```bash +git clone https://github.com/aaronsb/knowledge-graph-system.git +cd knowledge-graph-system +``` + +### 2. Initialize + +```bash +./operator.sh init +``` + +This starts an interactive wizard that: +- Generates secure secrets +- Detects your GPU (if any) +- Starts the containers +- Creates an admin user + +Follow the prompts. For defaults, just press Enter. + +### 3. Access the System + +Once initialization completes: + +- **Web interface**: http://localhost:3000 +- **API**: http://localhost:8000 +- **Login**: Use the admin credentials you set during init + +### 4. Verify It Works + +```bash +./operator.sh status +``` + +You should see all containers running and healthy. + +## Optional: Install the CLI + +The `kg` CLI provides command-line access to the knowledge graph: + +```bash +cd cli +npm install +npm run build +./install.sh +``` + +Then test it: + +```bash +kg health +kg search "test query" +``` + +## What's Next? + +**Try ingesting a document:** +```bash +kg ingest /path/to/document.pdf +``` + +**Or via the web interface:** +1. Go to http://localhost:3000 +2. Log in with your admin credentials +3. Navigate to Ingest +4. Upload a document + +**Learn more:** +- [Production Deployment](production.md) - For real use +- [Configuration](configuration.md) - All the settings +- [Using the System](../using/README.md) - How to use it + +## Troubleshooting + +**Containers won't start:** +```bash +./operator.sh logs # Check for errors +docker ps -a # See container status +``` + +**Port already in use:** +Edit `.env` and change the port mappings, or stop whatever's using ports 3000/8000. + +**Out of memory:** +The API container needs memory for ML models. Ensure 8GB+ available. + +**GPU not detected:** +Run `./operator.sh init` again and check GPU detection, or manually set `GPU_MODE=cpu` in `.operator.conf`. + +See [Troubleshooting](troubleshooting.md) for more. diff --git a/docs/operating/troubleshooting.md b/docs/operating/troubleshooting.md new file mode 100644 index 000000000..afa48d12c --- /dev/null +++ b/docs/operating/troubleshooting.md @@ -0,0 +1,341 @@ +# Troubleshooting + +Common issues and how to fix them. + +## Container Issues + +### Containers Won't Start + +**Check logs:** +```bash +./operator.sh logs +docker logs kg-api +docker logs kg-postgres +``` + +**Check container status:** +```bash +docker ps -a +``` + +Look for containers in "Exited" state and check their logs. + +### Port Already in Use + +``` +Error: bind: address already in use +``` + +Something else is using port 3000 (web) or 8000 (api). + +**Find what's using the port:** +```bash +lsof -i :3000 +lsof -i :8000 +``` + +**Options:** +1. Stop the conflicting service +2. Change ports in `.env` or compose file + +### Out of Memory + +``` +Killed +``` +or +``` +Container exited with code 137 +``` + +The API container needs memory for ML models. + +**Check memory:** +```bash +docker stats +free -h +``` + +**Solutions:** +- Ensure 8GB+ RAM available +- Reduce `MAX_CONCURRENT_JOBS` in `.env` +- Use CPU mode if GPU memory is limited + +### Container Health Check Failing + +```bash +./operator.sh status +# Shows containers as "unhealthy" +``` + +**Check health status:** +```bash +docker inspect kg-api | grep -A 10 Health +``` + +**Common causes:** +- Database not ready yet (wait longer) +- API startup still in progress (models loading) +- Configuration error (check logs) + +## Database Issues + +### Database Connection Failed + +``` +Connection refused +``` + +**Check PostgreSQL is running:** +```bash +docker ps | grep postgres +docker logs kg-postgres +``` + +**Check network:** +```bash +docker network ls +docker network inspect kg-internal +``` + +### Migration Errors + +``` +Migration failed +``` + +**Check migration logs:** +```bash +./operator.sh logs operator +``` + +**Run migrations manually:** +```bash +./operator.sh shell +python -m api.database.migrate +``` + +### Database Corrupted + +If PostgreSQL won't start due to corruption: + +1. **Try recovery mode:** + ```bash + docker exec -it kg-postgres psql -U admin -d knowledge_graph + # If this works, backup immediately + ``` + +2. **Restore from backup:** + ```bash + ./operator.sh restore /path/to/backup.sql + ``` + +3. **Last resort - reinitialize:** + ```bash + ./operator.sh teardown # WARNING: Destroys data + ./operator.sh init + ``` + +## Authentication Issues + +### Can't Log In + +**Check OAuth client exists:** +```bash +docker exec kg-postgres psql -U admin -d knowledge_graph -c \ + "SELECT client_id, redirect_uris FROM kg_auth.oauth_clients;" +``` + +**Ensure redirect URI matches:** +The registered redirect URI must match your `WEB_HOSTNAME`. + +**Reset admin password:** +```bash +./operator.sh shell +configure.py reset-password --username admin +``` + +### 500 Error on Login + +Check API logs: +```bash +./operator.sh logs api +``` + +**Common causes:** +- OAuth client missing `scopes` column (fixed in recent versions) +- Database connection issue +- Secret key mismatch + +### Token Expired + +Tokens expire after `ACCESS_TOKEN_EXPIRE_MINUTES` (default 60). + +**Solution:** Log out and log in again, or increase the timeout in `.env`. + +## HTTPS/SSL Issues + +### Certificate Not Found + +``` +SSL_CTX_use_certificate_chain_file failed +``` + +**Check certificate paths:** +```bash +ls -la /srv/docker/data/knowledge-graph/certs/ +``` + +**Check nginx config matches:** +```bash +cat docker/nginx.prod.conf | grep ssl_certificate +``` + +### Certificate Expired + +**Renew manually:** +```bash +~/.acme.sh/acme.sh --renew -d kg.example.com +~/.acme.sh/acme.sh --install-cert -d kg.example.com \ + --key-file /srv/docker/data/knowledge-graph/certs/kg.example.com.key \ + --fullchain-file /srv/docker/data/knowledge-graph/certs/kg.example.com.fullchain.cer +docker exec kg-web nginx -s reload +``` + +### Mixed Content Warnings + +Browser blocks HTTP requests from HTTPS page. + +**Check frontend config:** +```bash +docker exec kg-web cat /usr/share/nginx/html/config.js +``` + +The `apiUrl` should use `https://` if your site uses HTTPS. + +**Fix:** Set correct `WEB_HOSTNAME` and restart: +```bash +docker compose -f docker/docker-compose.prod.yml up -d web +``` + +## GPU Issues + +### GPU Not Detected + +**Check NVIDIA runtime:** +```bash +docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi +``` + +**Check container GPU access:** +```bash +docker exec kg-api nvidia-smi +``` + +**Reinstall NVIDIA Container Toolkit:** +```bash +# Ubuntu/Debian +sudo apt-get install -y nvidia-container-toolkit +sudo systemctl restart docker +``` + +### CUDA Out of Memory + +``` +CUDA out of memory +``` + +**Reduce batch size** by setting environment variable or reducing concurrent jobs. + +**Use CPU fallback:** +Set `GPU_MODE=cpu` in `.operator.conf` and restart. + +## Ingestion Issues + +### Job Stuck in Pending + +Jobs require approval by default. + +**Check job status:** +```bash +kg job list +``` + +**Approve pending jobs:** +```bash +kg job approve +``` + +**Enable auto-approval** in ingestion request: +```bash +kg ingest --auto-approve document.pdf +``` + +### Extraction Failing + +**Check API logs:** +```bash +./operator.sh logs api | grep -i error +``` + +**Common causes:** +- AI provider API key invalid or expired +- Rate limited by provider +- Document format not supported + +### Large Document Timeout + +**Increase timeout:** +Edit nginx configuration or API timeout settings. + +**Split document:** +Break into smaller files before ingestion. + +## CLI Issues + +### kg Command Not Found + +**Reinstall CLI:** +```bash +cd cli +npm run build +./install.sh +``` + +**Check PATH:** +```bash +echo $PATH +which kg +``` + +### CLI Authentication Failed + +**Re-authenticate:** +```bash +kg auth login +``` + +**Check API URL:** +```bash +kg config show +``` + +## Getting More Help + +### Collect Diagnostic Info + +```bash +./operator.sh status +./operator.sh logs > logs.txt 2>&1 +docker ps -a +docker version +cat .env | grep -v KEY | grep -v SECRET | grep -v PASSWORD +``` + +### Report an Issue + +Open an issue at https://github.com/aaronsb/knowledge-graph-system/issues with: +- What you were trying to do +- What happened instead +- Relevant logs (sanitize secrets!) +- System info (OS, Docker version, GPU) diff --git a/docs/operating/upgrading.md b/docs/operating/upgrading.md new file mode 100644 index 000000000..c3145bb2f --- /dev/null +++ b/docs/operating/upgrading.md @@ -0,0 +1,107 @@ +# Upgrading + +How to upgrade the knowledge graph system to new versions. + +## Quick Upgrade + +```bash +./operator.sh upgrade +``` + +This command: +1. Pulls latest images (if using GHCR) +2. Backs up the database +3. Stops application containers +4. Runs database migrations +5. Starts containers with new images +6. Verifies health + +## Upgrade Options + +```bash +# See what would change without doing it +./operator.sh upgrade --dry-run + +# Skip backup (faster, but risky) +./operator.sh upgrade --no-backup + +# Upgrade to specific version +./operator.sh upgrade --version 0.5.0 +``` + +## Before Upgrading + +1. **Check the changelog** for breaking changes +2. **Backup your data:** + ```bash + ./operator.sh backup + ``` +3. **Note your current version:** + ```bash + cat VERSION + ``` + +## After Upgrading + +1. **Verify health:** + ```bash + ./operator.sh status + ``` + +2. **Check logs for errors:** + ```bash + ./operator.sh logs api + ``` + +3. **Test core functionality:** + ```bash + kg health + kg search "test" + ``` + +## Rolling Back + +If something goes wrong: + +1. **Stop containers:** + ```bash + ./operator.sh stop + ``` + +2. **Restore database:** + ```bash + ./operator.sh restore /path/to/backup.sql + ``` + +3. **Use previous image version:** + Edit compose file to pin previous version tag, then: + ```bash + ./operator.sh start + ``` + +## Version Pinning + +By default, GHCR deployments use `:latest`. To pin a specific version: + +Edit `docker-compose.prod.yml`: +```yaml +api: + image: ghcr.io/aaronsb/knowledge-graph-system/kg-api:0.5.0 +``` + +## Migration Notes + +Database migrations run automatically during upgrade. If a migration fails: + +1. Check logs: + ```bash + ./operator.sh logs operator + ``` + +2. Run manually: + ```bash + ./operator.sh shell + python -m api.database.migrate + ``` + +3. If stuck, restore from backup and report the issue. diff --git a/docs/using/README.md b/docs/using/README.md new file mode 100644 index 000000000..da43f6ba1 --- /dev/null +++ b/docs/using/README.md @@ -0,0 +1,70 @@ +# Using the Knowledge Graph + +Guides for working with the knowledge graph system as a user. + +## Getting Started + +Once the system is [running](../operating/quick-start.md): + +1. **Log in** to the web interface at http://localhost:3000 (or your configured hostname) +2. **Ingest documents** to build your knowledge base +3. **Search and explore** to discover connections +4. **Query via CLI or MCP** for programmatic access + +## Guides + +| Guide | Description | +|-------|-------------| +| [Ingesting Documents](ingesting.md) | Adding documents to the system | +| [Exploring Knowledge](exploring.md) | Finding and navigating concepts | +| [Querying](querying.md) | CLI, API, and MCP access | +| [Understanding Grounding](understanding-grounding.md) | Confidence and contradiction | + +## Quick Examples + +### Ingest a Document + +**Via CLI:** +```bash +kg ingest /path/to/document.pdf --ontology research +``` + +**Via Web:** +Navigate to Ingest → Upload file → Approve job + +### Search for Concepts + +**Via CLI:** +```bash +kg search "climate change effects" +``` + +**Via Web:** +Use the search bar in the top navigation + +### Explore Connections + +**Via CLI:** +```bash +kg search details +kg search related +kg search connect "concept A" "concept B" +``` + +**Via Web:** +Click any concept to see its relationships and sources + +## For AI Assistants + +If you're an AI agent using this system via MCP: + +- Use `search` to find concepts by meaning +- Use `concept` with `action: "details"` for full evidence +- Use `concept` with `action: "connect"` to find paths between ideas +- Check `grounding_strength` to assess reliability + +See [Querying](querying.md) for full MCP tool documentation. + +--- + +See [Concepts](../concepts/README.md) for the conceptual foundation behind the system. diff --git a/docs/using/exploring.md b/docs/using/exploring.md new file mode 100644 index 000000000..253c9aeea --- /dev/null +++ b/docs/using/exploring.md @@ -0,0 +1,252 @@ +# Exploring Knowledge + +How to find and navigate concepts in your knowledge graph. + +## Overview + +After ingesting documents, you have a graph of interconnected concepts. Exploring means: +- **Searching** - Finding concepts by meaning +- **Navigating** - Following relationships between concepts +- **Connecting** - Discovering paths between ideas +- **Verifying** - Tracing concepts back to sources + +## Semantic Search + +Search finds concepts by meaning, not just keywords. + +### CLI + +```bash +kg search "climate change effects" +``` + +Returns concepts semantically similar to your query, even if they use different words. + +**With options:** +```bash +# Limit results +kg search --limit 20 "economic policy" + +# Filter by ontology +kg search --ontology "research" "neural networks" + +# Show more detail +kg search --verbose "machine learning" +``` + +### Web Interface + +1. Use the **search bar** at the top +2. Results show concepts ranked by similarity +3. Click any concept to see details + +### Understanding Results + +Each result shows: +- **Concept name** - The extracted idea +- **Similarity score** - How close to your query (0-1) +- **Grounding** - How well-supported (-1 to +1) +- **Source count** - How many documents mention it + +## Viewing Concept Details + +### CLI + +```bash +kg search details +``` + +Shows: +- Full concept information +- All evidence (source text that led to this concept) +- Relationships to other concepts +- Grounding breakdown + +### Web Interface + +Click any concept to open its detail view: +- **Evidence panel** - Original text excerpts +- **Relationships panel** - Connected concepts +- **Sources panel** - Documents where it appears + +## Navigating Relationships + +Concepts connect to other concepts. Explore these connections: + +### Find Related Concepts + +```bash +kg search related +``` + +Returns concepts directly connected, grouped by relationship type: +- Supports +- Contradicts +- Implies +- Causes +- Part of + +### Filter by Relationship Type + +```bash +kg search related --type SUPPORTS +kg search related --type CONTRADICTS +``` + +### Explore Deeper + +```bash +# Go 2 hops out +kg search related --depth 2 + +# Go 3 hops +kg search related --depth 3 +``` + +More hops = more concepts, but further from the original. + +## Finding Connections + +Discover how two concepts relate: + +### CLI + +```bash +kg search connect "concept A" "concept B" +``` + +Finds paths between concepts, showing how ideas chain together. + +**Options:** +```bash +# Limit path length +kg search connect "X" "Y" --max-hops 3 + +# Use concept IDs for precision +kg search connect abc123 def456 +``` + +### What Paths Show + +A path might look like: +``` +Climate Change + ──[CAUSES]──> Sea Level Rise + ──[AFFECTS]──> Coastal Cities + ──[IMPLIES]──> Migration Patterns +``` + +This reveals the chain of reasoning connecting distant ideas. + +## Exploring Contradictions + +Find where sources disagree: + +### Search for Contested Concepts + +```bash +kg search "vaccination effects" +``` + +Look for concepts with mixed grounding (scores near 0) in the results. + +### View Both Sides + +When you find a contested concept: +```bash +kg search details +``` + +The evidence section shows which sources support and which contradict. + +### Filter by Epistemic Status + +```bash +kg vocabulary list --status CONTESTED +``` + +Shows relationship types that have mixed evidence across the graph. + +## Exploring by Source + +Start from a document and see what was extracted: + +### List Sources in an Ontology + +```bash +kg ontology files +``` + +### View Document's Concepts + +In the web interface: +1. Navigate to **Documents** +2. Select a document +3. See all concepts extracted from it + +## Visual Exploration (Web) + +The web interface provides visual navigation: + +### Graph View +- Concepts as nodes +- Relationships as edges +- Click to focus +- Drag to rearrange +- Zoom to explore + +### Filters +- By ontology +- By grounding threshold +- By relationship type +- By date range + +### Highlighting +- Hover to see connections +- Click to lock focus +- Double-click to expand neighborhood + +## Exploration Strategies + +### Start Broad, Narrow Down +1. Search for a general topic +2. Find a relevant concept +3. Explore its relationships +4. Follow promising connections + +### Follow Contradictions +1. Look for low-grounding concepts +2. Check which sources disagree +3. Understand both perspectives +4. Form your own view + +### Map a Topic +1. Search for the central concept +2. Get all related concepts (depth 2-3) +3. Look for clusters and bridges +4. Identify key relationships + +### Verify Claims +1. Find the concept +2. Check its grounding score +3. Read the source evidence +4. Trace to original documents + +## Tips + +### Use Specific Queries +"Effects of sleep deprivation on memory" finds more relevant concepts than "sleep". + +### Check Grounding Before Trusting +High grounding (> 0.7) means many sources agree. Low or negative means contested or contradicted. + +### Explore Neighborhoods +The most interesting insights often come from concepts 2-3 hops away from your starting point. + +### Compare Ontologies +If you have separate knowledge bases, search both to see how different document sets treat the same topics. + +## Next Steps + +- [Understanding Grounding](understanding-grounding.md) - Interpret confidence scores +- [Querying](querying.md) - Programmatic access via CLI, API, and MCP diff --git a/docs/using/ingesting.md b/docs/using/ingesting.md new file mode 100644 index 000000000..5ae67eb80 --- /dev/null +++ b/docs/using/ingesting.md @@ -0,0 +1,198 @@ +# Ingesting Documents + +How to add documents to your knowledge graph. + +## Overview + +Ingestion is how documents become knowledge. When you ingest a document: + +1. The system stores the original text +2. Splits it into manageable chunks +3. Extracts concepts from each chunk +4. Discovers relationships between concepts +5. Calculates grounding based on evidence + +## Supported Formats + +| Format | Extension | Notes | +|--------|-----------|-------| +| Plain text | `.txt` | Direct processing | +| Markdown | `.md` | Preserves structure | +| PDF | `.pdf` | Text extraction | +| Word | `.docx` | Text extraction | +| Web pages | URL | Fetches and processes | + +## Using the CLI + +### Basic Ingestion + +```bash +kg ingest /path/to/document.pdf +``` + +This creates a job that requires approval (to confirm cost estimate). + +### Auto-Approve + +Skip the approval step: + +```bash +kg ingest --auto-approve /path/to/document.pdf +``` + +### Specify an Ontology + +Organize documents into collections: + +```bash +kg ingest --ontology "research-papers" /path/to/paper.pdf +``` + +If the ontology doesn't exist, it's created automatically. + +### Ingest Multiple Files + +```bash +kg ingest --ontology "project-docs" doc1.md doc2.md doc3.pdf +``` + +### Ingest a Directory + +```bash +kg ingest --ontology "codebase" --recursive /path/to/docs/ +``` + +### Check Job Status + +```bash +kg job list +kg job status +``` + +## Using the Web Interface + +1. **Navigate to Ingest** in the top menu +2. **Upload file(s)** using the file picker or drag-and-drop +3. **Select ontology** (or create new) +4. **Review cost estimate** - shows expected tokens and cost +5. **Approve** to start processing +6. **Monitor progress** in the Jobs view + +## Using the API + +```bash +curl -X POST "http://localhost:8000/ingest" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: multipart/form-data" \ + -F "file=@document.pdf" \ + -F "ontology=research" +``` + +Response includes a job ID for tracking. + +## What Happens During Ingestion + +### 1. Document Storage + +The original document is stored in Garage (object storage). You can always retrieve it later. + +### 2. Chunking + +Documents are split into chunks of roughly 1000 words, with overlap to preserve context across boundaries. This ensures: +- Each chunk is small enough to process +- Ideas that span page breaks aren't lost + +### 3. Concept Extraction + +Each chunk is analyzed by an AI model (GPT-4, Claude, etc.) to extract: +- **Concepts**: The key ideas +- **Types**: What kind of idea (claim, definition, entity, etc.) +- **Relationships**: How concepts in this chunk relate + +### 4. Matching + +New concepts are compared to existing ones. If a concept already exists: +- They're merged (same idea, more evidence) +- Grounding increases (more sources confirm it) + +If concepts conflict: +- Both are kept +- Contradiction is noted +- Sources are preserved for both views + +### 5. Grounding Calculation + +After extraction, grounding scores are calculated: +- How many sources mention this concept? +- Do they agree or disagree? +- How strong is the evidence? + +## Cost Estimation + +Before processing, the system estimates: +- Number of chunks +- Expected tokens (input + output) +- Approximate cost (based on your AI provider's pricing) + +This is why jobs require approval by default - so you can review before incurring costs. + +## Ontologies + +Ontologies are collections of related knowledge. Use them to: +- Separate different topics (research vs meeting-notes) +- Query specific domains +- Control who can access what + +```bash +# Create by ingesting with a new name +kg ingest --ontology "climate-research" paper1.pdf + +# List ontologies +kg ontology list + +# Query specific ontology +kg search --ontology "climate-research" "temperature effects" +``` + +## Tips + +### Start Small +Ingest a few documents first to understand the output before processing large collections. + +### Use Meaningful Ontology Names +You'll query by ontology later. Names like "research-2024" are clearer than "stuff". + +### Review Extractions +After ingesting, search for concepts and verify they match what you expected. This helps you understand how the system interprets your documents. + +### Re-ingest if Needed +If extraction quality improves (new models, updated prompts), you can re-ingest documents. The system deduplicates based on content hashes. + +## Troubleshooting + +### Job Stuck in Pending +Approve it: +```bash +kg job approve +``` + +Or use `--auto-approve` when ingesting. + +### Extraction Seems Wrong +Check which AI provider you're using: +```bash +kg config show +``` + +Different models have different extraction quality. + +### Out of Memory +Large documents with many chunks can exhaust memory. Try: +- Splitting into smaller files +- Reducing `MAX_CONCURRENT_JOBS` in configuration +- Using a machine with more RAM + +## Next Steps + +- [Exploring Knowledge](exploring.md) - Navigate what you've ingested +- [Understanding Grounding](understanding-grounding.md) - Interpret confidence scores diff --git a/docs/using/querying.md b/docs/using/querying.md new file mode 100644 index 000000000..57cca6d20 --- /dev/null +++ b/docs/using/querying.md @@ -0,0 +1,329 @@ +# Querying the Knowledge Graph + +Programmatic access via CLI, REST API, and MCP. + +## Overview + +Three ways to query: + +| Method | Best For | Authentication | +|--------|----------|----------------| +| **CLI** | Interactive use, scripts | OAuth (browser login) | +| **REST API** | Custom applications | OAuth tokens | +| **MCP** | AI assistants (Claude, etc.) | Configured per-assistant | + +## CLI Queries + +The `kg` command-line interface provides full access. + +### Authentication + +```bash +# Login (opens browser for OAuth) +kg login + +# Check configuration and auth status +kg config show + +# Logout +kg logout +``` + +### Search + +```bash +# Basic search +kg search "your query" + +# With options +kg search --limit 20 --ontology "research" "machine learning" + +# Output formats +kg search --format json "query" +kg search --format table "query" +``` + +### Concept Operations + +```bash +# Get full details +kg search details + +# Find related concepts +kg search related +kg search related --depth 2 --type SUPPORTS + +# Find paths between concepts +kg search connect "concept A" "concept B" +kg search connect --max-hops 4 +``` + +### Ontology Operations + +```bash +# List ontologies +kg ontology list + +# Get ontology info +kg ontology info + +# List files in ontology +kg ontology files +``` + +### Job Management + +```bash +# List jobs +kg job list +kg job list --status pending + +# Check job status +kg job status + +# Approve pending job +kg job approve + +# Cancel job +kg job cancel +``` + +## REST API + +Direct HTTP access for custom applications. + +### Authentication + +Get an OAuth token first: + +```bash +# Using the CLI token +TOKEN=$(kg auth token) + +# Or via OAuth flow +curl -X POST "http://localhost:8000/auth/oauth/token" \ + -d "grant_type=authorization_code&code=..." +``` + +### Search Endpoint + +```bash +curl "http://localhost:8000/concepts/search?query=climate+change&limit=10" \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** +```json +{ + "results": [ + { + "concept_id": "abc123", + "name": "Climate change increases extreme weather", + "similarity": 0.89, + "grounding_strength": 0.72, + "source_count": 5 + } + ] +} +``` + +### Concept Details + +```bash +curl "http://localhost:8000/concepts/abc123" \ + -H "Authorization: Bearer $TOKEN" +``` + +**Response:** +```json +{ + "concept_id": "abc123", + "name": "Climate change increases extreme weather", + "type": "claim", + "grounding_strength": 0.72, + "evidence": [ + { + "source_id": "src456", + "text": "Studies show that climate change...", + "document": "IPCC Report 2023" + } + ], + "relationships": [ + { + "type": "CAUSES", + "target_id": "def789", + "target_name": "Increased flooding" + } + ] +} +``` + +### Ingest Document + +```bash +curl -X POST "http://localhost:8000/ingest" \ + -H "Authorization: Bearer $TOKEN" \ + -H "Content-Type: multipart/form-data" \ + -F "file=@document.pdf" \ + -F "ontology=research" \ + -F "auto_approve=true" +``` + +### Full API Reference + +See [API Reference](../reference/api/README.md) for complete endpoint documentation. + +## MCP (Model Context Protocol) + +For AI assistants like Claude Desktop. + +### Setup + +Add to your Claude Desktop config (`~/.config/claude/claude_desktop_config.json`): + +```json +{ + "mcpServers": { + "knowledge-graph": { + "command": "node", + "args": ["/path/to/knowledge-graph-system/mcp/dist/index.js"], + "env": { + "KG_API_URL": "http://localhost:8000" + } + } + } +} +``` + +### Available Tools + +Once configured, Claude can use these tools: + +#### search +Find concepts by semantic similarity. + +``` +Use search tool with query "climate effects on agriculture" +``` + +**Parameters:** +- `query` (required): Search text +- `limit`: Max results (default 10) +- `min_similarity`: Threshold 0-1 (default 0.7) +- `ontology`: Filter by ontology name + +#### concept +Work with specific concepts. + +**Actions:** +- `details`: Get full concept with evidence +- `related`: Find connected concepts +- `connect`: Find paths between concepts + +``` +Use concept tool with action "details" and concept_id "abc123" +``` + +``` +Use concept tool with action "connect", from_query "inflation", to_query "unemployment" +``` + +#### ingest +Add documents to the knowledge graph. + +``` +Use ingest tool with action "text", text "...", ontology "notes" +``` + +#### ontology +Manage knowledge collections. + +``` +Use ontology tool with action "list" +``` + +#### source +Retrieve original source text. + +``` +Use source tool with source_id "src456" +``` + +### MCP for AI Reasoning + +When an AI uses MCP, it can: + +1. **Query for context**: Before answering, search for relevant concepts +2. **Check grounding**: Verify claims have evidence +3. **Find contradictions**: Identify where sources disagree +4. **Trace sources**: Link answers to original documents +5. **Build knowledge**: Ingest new information during conversation + +**Example AI workflow:** +``` +User: "What are the effects of sleep deprivation?" + +AI thinks: Let me check the knowledge graph... +[Uses search tool: "sleep deprivation effects"] + +AI: Based on the knowledge graph, sleep deprivation has several documented effects: +- Memory impairment (grounding: 0.85, 12 sources) +- Reduced cognitive function (grounding: 0.78, 8 sources) +- Increased accident risk (grounding: 0.65, 5 sources) + +Sources include: Smith et al. 2023, Sleep Research Journal... +``` + +### Full MCP Reference + +See [MCP Reference](../reference/mcp/README.md) for complete tool documentation. + +## Query Patterns + +### Get Grounded Answers + +1. Search for the topic +2. Check grounding scores +3. For high-grounding concepts, trust the answer +4. For low-grounding, caveat with uncertainty + +### Explore a Topic + +1. Search for central concept +2. Get related concepts (depth 2) +3. Look for clusters of connected ideas +4. Follow interesting paths + +### Verify a Claim + +1. Search for the specific claim +2. Check if concept exists +3. If yes, check grounding and sources +4. If no, the claim isn't in your knowledge base + +### Find Disagreements + +1. Search for contested topic +2. Look for concepts with grounding near 0 +3. Get details to see conflicting sources +4. Use `connect` to find the conflict structure + +## Tips + +### Use Semantic Queries +"What causes inflation" works better than "inflation causes" because the system matches by meaning. + +### Check Multiple Phrasings +If initial search misses, try synonyms or related terms. + +### Follow the Evidence +Always check source text for important claims. Grounding tells you confidence, sources tell you why. + +### Combine Methods +Use CLI for exploration, API for automation, MCP for AI-assisted work. + +## Next Steps + +- [Understanding Grounding](understanding-grounding.md) - Interpret confidence scores +- [API Reference](../reference/api/README.md) - Complete endpoint docs +- [MCP Reference](../reference/mcp/README.md) - All MCP tools diff --git a/docs/using/understanding-grounding.md b/docs/using/understanding-grounding.md new file mode 100644 index 000000000..4c519925e --- /dev/null +++ b/docs/using/understanding-grounding.md @@ -0,0 +1,251 @@ +# Understanding Grounding + +How to interpret confidence, contradiction, and epistemic status. + +## What Is Grounding? + +Grounding measures how well-supported a concept is across your sources. It answers: "How much should I trust this idea?" + +Unlike a simple count ("mentioned 5 times"), grounding considers: +- **Agreement**: Do sources confirm each other? +- **Contradiction**: Do sources disagree? +- **Evidence strength**: How directly does source text support the concept? + +## The Grounding Scale + +Grounding scores range from **-1.0 to +1.0**: + +| Score Range | Meaning | Interpretation | +|-------------|---------|----------------| +| **0.8 to 1.0** | Strongly supported | Multiple sources agree strongly | +| **0.5 to 0.8** | Well supported | Good evidence, some sources confirm | +| **0.2 to 0.5** | Moderately supported | Some evidence, room for uncertainty | +| **-0.2 to 0.2** | Mixed or insufficient | Sources disagree, or too few sources | +| **-0.5 to -0.2** | Contested | More contradiction than support | +| **-1.0 to -0.5** | Contradicted | Strong evidence against | + +## Reading Grounding in Practice + +### High Grounding (> 0.7) + +``` +Concept: "Sleep deprivation impairs memory consolidation" +Grounding: 0.85 +Sources: 12 +``` + +**What this means:** +- 12 sources mention this concept +- They largely agree +- You can cite this with confidence + +**Still verify:** Check the actual sources if making important decisions. + +### Moderate Grounding (0.3 - 0.7) + +``` +Concept: "Coffee consumption prevents heart disease" +Grounding: 0.45 +Sources: 8 +``` + +**What this means:** +- Some sources support this +- Evidence is mixed or qualified +- Treat as "possibly true" rather than "established" + +**Action:** Look at the evidence to understand nuances. + +### Low or Negative Grounding (< 0.3) + +``` +Concept: "Vitamin C cures the common cold" +Grounding: -0.15 +Sources: 6 +``` + +**What this means:** +- Sources disagree significantly +- Some support, some contradict +- This is a contested claim + +**Action:** Examine both sides before drawing conclusions. + +## How Grounding Is Calculated + +### Evidence Accumulation + +Each time a concept appears in a source, evidence accumulates: + +``` +Document 1: "Studies confirm X..." → +evidence +Document 2: "X is well-established..." → +evidence +Document 3: "X has been demonstrated..." → +evidence +``` + +More confirming sources = higher grounding. + +### Contradiction Detection + +When sources disagree: + +``` +Document 1: "X causes Y" +Document 2: "X does not cause Y" +``` + +Both are recorded. Grounding reflects the balance: +- More support than contradiction → positive grounding +- More contradiction than support → negative grounding +- Equal → near-zero grounding + +### Relationship Strength + +Not all mentions are equal. The system considers: +- Direct claims vs passing mentions +- Central thesis vs tangential reference +- Explicit statements vs implied connections + +## Epistemic Status + +Beyond grounding scores, concepts and relationships have epistemic status: + +| Status | Meaning | +|--------|---------| +| **Affirmative** | High grounding, well-established | +| **Contested** | Significant disagreement between sources | +| **Contradictory** | Strong evidence against | +| **Historical** | Was accurate in its time period | +| **Insufficient Data** | Too few sources to judge | + +### Checking Epistemic Status + +```bash +# See status for relationship types +kg vocabulary list --status CONTESTED + +# Filter concepts by status +kg search "topic" --status AFFIRMATIVE +``` + +## Working with Contradictions + +Contradictions are features, not bugs. They reveal: +- Where experts disagree +- Evolving knowledge over time +- Different perspectives or contexts + +### Finding Contradictions + +Look for concepts with: +- Grounding near 0 +- Multiple sources with opposing views +- Relationships marked CONTRADICTS + +```bash +# Search and note low-grounding results +kg search "controversial topic" + +# Get details to see both sides +kg concept details +``` + +### Understanding Both Sides + +The evidence section shows which sources support and which contradict: + +``` +Evidence: + [+] "Research by Smith shows X is true..." + [+] "Jones et al. confirmed that X..." + [-] "However, Brown's study found X is false..." + [-] "Recent work contradicts earlier findings on X..." +``` + +### Making Decisions with Contradictions + +1. **Count isn't everything** - One rigorous study may outweigh many weak ones +2. **Check recency** - Newer research may supersede older +3. **Consider context** - Different conditions may explain disagreement +4. **Acknowledge uncertainty** - Some questions don't have clear answers + +## Grounding vs. Truth + +**Grounding measures evidence in your knowledge base, not absolute truth.** + +A concept with high grounding means: +- ✅ Your sources agree on this +- ❌ Does NOT mean it's universally true + +A concept with low grounding means: +- ✅ Your sources disagree or lack evidence +- ❌ Does NOT mean it's false + +**The quality of grounding depends on the quality of your sources.** + +## Practical Guidelines + +### For Research + +- Use high-grounding concepts as established foundations +- Investigate low-grounding concepts as areas of uncertainty +- Document which sources you're relying on + +### For Decision-Making + +- Prefer high-grounding concepts for critical decisions +- For contested topics, understand both sides before deciding +- Be explicit about uncertainty when grounding is low + +### For AI Assistants + +When using MCP, the AI should: +- Check grounding before making claims +- Caveat low-grounding information appropriately +- Cite sources for important statements +- Acknowledge contradictions when they exist + +## Improving Grounding + +### Add More Sources + +Grounding improves with more evidence: +```bash +kg ingest additional-sources/*.pdf --ontology research +``` + +### Update with Recent Research + +Newer sources may resolve old contradictions: +```bash +kg ingest latest-study.pdf --ontology research +``` + +### Separate Domains + +Different ontologies can have different evidence bases: +```bash +# Medical research has high grounding for X +kg search --ontology medical "treatment X" + +# General news has low grounding for X +kg search --ontology news "treatment X" +``` + +## Summary + +| Question | Look At | +|----------|---------| +| "Can I trust this?" | Grounding score | +| "Where did this come from?" | Evidence section | +| "Do sources agree?" | Grounding sign (+/-) and evidence | +| "How established is this?" | Epistemic status | +| "What's the other side?" | CONTRADICTS relationships | + +Grounding gives you the tools to reason about knowledge quality, not just knowledge content. Use it to make informed decisions about what to trust and where to dig deeper. + +## Next Steps + +- [Exploring Knowledge](exploring.md) - Navigate the graph +- [Concepts: How It Works](../concepts/how-it-works.md) - Deeper understanding +- [Concepts: Glossary](../concepts/glossary.md) - Term definitions