diff --git a/lifewatch_batch_platform/terraform/environments/dev/readme.md b/lifewatch_batch_platform/terraform/environments/dev/readme.md index 2311f2f..1b8d541 100644 --- a/lifewatch_batch_platform/terraform/environments/dev/readme.md +++ b/lifewatch_batch_platform/terraform/environments/dev/readme.md @@ -10,7 +10,8 @@ API Gateway (REST) ├── POST /batch/jobs → lambda_batch_trigger ──► AWS Batch ├── GET /batch/jobs/{id} → lambda_job_status ──► AWS Batch ├── GET /batch/jobs/{id}/logs → lambda_job_logs ──► CloudWatch - └── GET /batch/jobs/{id}/results → lambda_job_results ──► S3 + ├── GET /batch/jobs/{id}/results → lambda_job_results ──► S3 + └── GET /batch/jobs/history_list → lambda_job_history_list ▲ Batch workers (Fargate or EC2) @@ -60,10 +61,11 @@ Each Lambda function expects a ZIP file to exist at the path configured in `terr | Variable | Default path | Handler | |---|---|---| -| `lambda_trigger_filename` | `lambda.zip` | `lambda_function.lambda_handler` | -| `lambda_status_filename` | `status_lambda.zip` | `status.lambda_handler` | -| `lambda_logs_filename` | `logs_lambda.zip` | `logs.lambda_handler` | -| `lambda_results_filename` | `results_lambda.zip` | `results.lambda_handler` | +| `lambda_trigger_filename` | `../../backend_lambda_artifacts/lambda.zip` | `lambda_function.lambda_handler` | +| `lambda_status_filename` | `../../backend_lambda_artifacts/status_lambda.zip` | `status.lambda_handler` | +| `lambda_logs_filename` | `../../backend_lambda_artifacts/logs_lambda.zip` | `logs.lambda_handler` | +| `lambda_results_filename` | `../../backend_lambda_artifacts/results_lambda.zip` | `results.lambda_handler` | +| `lambda_history_list_filename` | `../../backend_lambda_artifacts/history_list_lambda.zip` | `history_list.lambda_handler` | Paths are relative to the `environments/dev/` directory. If your ZIPs live elsewhere, update the paths in `terraform.tfvars` accordingly (e.g. `"../../lambdas/lambda.zip"`). @@ -81,7 +83,8 @@ vpc │ │ ├── lambda_batch_trigger │ │ ├── lambda_job_status │ │ ├── lambda_job_logs - │ │ └── lambda_job_results + │ │ ├── lambda_job_results + │ │ └── lambda_job_history_list │ ├── batch_job_definition_fargate │ └── batch_job_definition_ec2 │ @@ -100,6 +103,7 @@ api_gateway ──► lambda_batch_trigger ──► lambda_job_status ──► lambda_job_logs ──► lambda_job_results + ──► lambda_job_history_list api_key_usage_plan ──► api_gateway ``` @@ -125,6 +129,7 @@ api_key_usage_plan ──► api_gateway | `lambda_job_status` | `modules/lambda/lambda_job_status` | Handles GET /batch/jobs/{id} | | `lambda_job_logs` | `modules/lambda/lambda_job_logs` | Handles GET /batch/jobs/{id}/logs | | `lambda_job_results` | `modules/lambda/lambda_job_results` | Handles GET /batch/jobs/{id}/results | +| `lambda_job_history_list` | `modules/lambda/lambda_job_history_list` | Handles GET /batch/jobs/history_list | | `api_gateway` | `modules/api_gateway` | REST API with all routes, Lambda integrations, CORS, and auto-redeployment trigger | | `api_key_usage_plan` | `modules/api_key_usage_plan` | API key and usage plan with throttling and optional quota | diff --git a/lifewatch_batch_platform/terraform/modules/api_gateway/readme.md b/lifewatch_batch_platform/terraform/modules/api_gateway/readme.md index da23381..2393dfa 100644 --- a/lifewatch_batch_platform/terraform/modules/api_gateway/readme.md +++ b/lifewatch_batch_platform/terraform/modules/api_gateway/readme.md @@ -22,12 +22,14 @@ This module is primarily **OpenAPI-driven**: | Resource | Description | |---|---| -| `aws_api_gateway_rest_api` | REST API body imported from `openapi.yaml` (including functional methods and OPTIONS preflight methods) | -| `aws_api_gateway_gateway_response` | Gateway-level error responses with CORS headers (`DEFAULT_4XX`, `DEFAULT_5XX`, `MISSING_AUTHENTICATION_TOKEN`, `RESOURCE_NOT_FOUND`) | -| `aws_api_gateway_deployment` | Versioned deployment with OpenAPI hash trigger | -| `aws_api_gateway_stage` | Deploys the active API to `stage_name` | - ---- +| `aws_api_gateway_rest_api` | The REST API with multipart/form-data binary support | +| `aws_api_gateway_resource` | Routes: `/batch`, `/batch/jobs`, `/batch/jobs/{job_id}`, `/batch/jobs/{job_id}/logs`, `/batch/jobs/{job_id}/results`, `/batch/jobs/history_list` | +| `aws_api_gateway_method` | `POST /batch/jobs`, `GET` on status/logs/results/history_list, `OPTIONS` on all routes | +| `aws_api_gateway_integration` | `AWS_PROXY` Lambda integrations for all functional routes; `MOCK` integrations for CORS preflight | +| `aws_api_gateway_method_response` | `200` responses declaring CORS headers on all OPTIONS routes | +| `aws_api_gateway_integration_response` | Injects `Access-Control-Allow-*` headers into preflight responses | +| `aws_api_gateway_deployment` | Versioned deployment with sha1 change-detection trigger | +| `aws_api_gateway_stage` | Deploys the API to the configured stage name | ## Routes (from OpenAPI) diff --git a/readme.md b/readme.md index dca5a42..0142e7d 100644 --- a/readme.md +++ b/readme.md @@ -1,155 +1,175 @@ -# Notebook Platform - Local Setup Guide +# LifeWatch Notebook Platform DevOps + +Infrastructure, CI/CD, and validation assets for running notebooks through AWS Batch behind an API Gateway and Lambda control plane. + +## Components + +- Terraform infrastructure for the dev environment. +- Lambda handlers and API request client scripts. +- Worker container build and publish pipeline. +- End-to-end notebook validation on AWS. +- Frontend operator UI for job submission and history. +- Demo notebook fixtures, including lightweight examples. + +## Setup Decisions + +The dev environment is shared by teammates. + +- Notebook E2E workflows must not run terraform destroy. +- Cleanup is limited to transient AWS Batch EC2 instances created during execution. +- Terraform-managed infrastructure is intentionally preserved after E2E runs. ## Prerequisites -- [Docker](https://docs.docker.com/get-docker/) -- [Docker Compose](https://docs.docker.com/compose/install/) +- Terraform 1.6 or newer. +- AWS CLI configured with credentials that can manage the target dev stack. +- Docker (for worker image build and local checks). +- Node.js 18+ and npm (for frontend local runs). +- Python 3.10+ (for helper scripts and selected workflows). -## Project Structure -``` +## Repository Layout + +```text dev-ops/ -├── server/ # Django web app + Celery worker config -│ ├── jobs/ # Core app (models, views, tasks) -│ ├── notebook_platform/ # Django project settings -│ ├── k8s-manifests/ # Kubernetes manifests (optional, not sure if we will use these w/ Terraform) -│ ├── docker-compose.yml -│ ├── Dockerfile -│ └── requirements.txt -├── worker/ # Notebook execution worker -│ ├── Dockerfile -│ ├── worker.py -│ └── inputs/ -│ └── environment.yaml # Base env for all runners, can be updated in runtime based on input files -└── .github/ - └── workflows/ - └── ci.yaml # Define actions +├── .github/workflows/ # CI/CD and E2E workflows +├── lifewatch_batch_platform/terraform/ # Terraform env + reusable modules + client scripts +├── frontend/ # React + TypeScript operator UI +├── worker/ # Batch worker image definition +├── demo_input/ # Notebook and payload fixtures for tests/manual runs +├── terraform-bootstrap/ # Remote-state bootstrap resources +└── job_profiles.json # Shared execution-profile catalogue ``` + +## Core Workflows + +| Workflow | File | Purpose | +|---|---|---| +| Smoke Test Worker Containerization | .github/workflows/smoke-test-worker-containerization.yml | Builds worker image and validates container startup. | +| Deploy Worker Image to ECR | .github/workflows/deploy-worker-ecr.yml | Publishes worker image tags to ECR. | +| Terraform CI | .github/workflows/test-terraform-plan.yml | Runs fmt, validate, plan, and optional terraform test. | +| Notebook E2E Deploy and Run | .github/workflows/e2e-notebook-deploy-and-run.yml | Applies infra, runs notebook E2E matrix, uploads artifacts, then performs EC2-only cleanup. | + +## Documentation Index + +- [Notebook E2E Runbook](.github/workflows/e2e-notebook-testing.md) +- [Lightweight Notebook Fixtures](demo_input/lightweight-notebooks/README.md) +- [Terraform dev Environment](lifewatch_batch_platform/terraform/environments/dev/readme.md) +- [Frontend Guide](frontend/README.md) +- [Terraform Bootstrap Guide](terraform-bootstrap/readme.md) + ## Quick Start -### 1. Build and Start All Services -You will need to do this every time you want to develop locally. -``` -cd server -docker compose up --build -d -``` -This will spin up the following server components: -- PostgreSQL Database # Stand-in for RDS, not used in production -- Redis Cache (Used by Celery) # Stand-in for ElastiCache, not used in production -- Minio # Stand-in for S3, not used in production -- Celery Service -- Django Server - -### 2. Run Database Migrations (First Time ONLY) -This is usually needed when running for the first time, or when making changes to the models. -In a separate terminal: -``` -docker compose exec web python manage.py migrate -``` -### 3. Create a Superuser (First Time ONLY) -``` -docker compose exec web python manage.py createsuperuser -``` -### 4. Start minikube -``` -minikube start --driver=docker -``` -### 5. Apply rbac.yaml (First Time ONLY) -``` -cd server -kubectl apply -f minikube-rbac.yaml -``` -### 6. Build Worker -This will build the image inside minikube. -``` -minikube image build -t r-notebook-worker:latest . -``` -### 7. Create Network Proxy (Windows Only) -``` -kubectl proxy --address='0.0.0.0' --port=8001 --accept-hosts='^.*' -``` +### Frontend -### 8. Access the Application -``` -http://localhost:8000 +```bash +cd frontend +npm install +npm run dev ``` -## Useful Commands -``` -# View Jobs (-w to keep in running) -kubectl get pods -w +### Terraform Dev Environment -# Get the logs of a specific job -kubectl logs +```bash +cd lifewatch_batch_platform/terraform/environments/dev +terraform init +terraform plan +``` -# Run commands inside the web container -docker compose exec web python +## Deployment Order (Recommended) -# Open a shell inside the web container -docker compose exec web bash +1. Initialize remote state bootstrap resources from terraform-bootstrap if not already provisioned. +2. Apply the dev environment Terraform stack. +3. Build and publish worker image to ECR. +4. Run notebook E2E workflows. -# Run tests inside docker -docker compose exec web python manage.py test -``` +## Onboarding Checklist -## CI/CD -A GitHub Actions workflow is located at `.github/workflows/ci.yaml` and runs automatically on push or can be triggered manually. -It builds everything as described above, uses a cache to speed up things, and runs a GET to the login screen to verify things are running. +Use this checklist for new contributors working on the dev environment. -## Pre-commit Hooks +1. Clone the repository and verify toolchain versions from Prerequisites. +2. Confirm AWS access by running aws sts get-caller-identity. +3. Ensure access to required GitHub repository secrets (see matrix below). +4. Run Terraform init and plan from lifewatch_batch_platform/terraform/environments/dev. +5. Run worker smoke test locally with docker build ./worker and a quick container run. +6. Start frontend locally (npm run dev) and verify API base URL configuration. +7. Review the E2E runbook before triggering shared-environment workflows. -This repository includes a `.pre-commit-config.yaml` with fast checks for: -- file hygiene (whitespace, EOF, YAML/JSON/TOML) -- Python lint/format for worker and backend Lambda/client scripts -- Terraform format + validation on changed Terraform directories -- frontend ESLint for TypeScript/React sources -- notebook output stripping under `demo_input` +## Local Validation Commands -Install and enable hooks: ```bash -pip install pre-commit -pre-commit install -``` +# Terraform style and validation +terraform fmt -recursive +terraform -chdir=lifewatch_batch_platform/terraform/environments/dev validate -Run all hooks manually: -```bash -pre-commit run --all-files +# Module tests (if present) +terraform -chdir=lifewatch_batch_platform/terraform/modules/api_gateway test + +# Frontend checks +cd frontend && npm run build ``` -## AWS Deployment with Zappa +## Terraform Environment Notes -The Django server is deployed to AWS Lambda using Zappa. To keep secrets out of version control, we use a template for the Zappa configuration and inject environment variables during deployment. +For detailed infrastructure inputs, outputs, and module dependencies, use: -### AWS Prerequisites -Before deploying, ensure the following resources and credentials exist in your AWS account: -- **IAM User:** Credentials configured locally or in the CI/CD pipeline with permissions to manage Lambda, API Gateway, S3, and IAM roles. -- **S3 Deployment Bucket:** A bucket for Zappa to store its zip packages during deployment. Different that the Django storage bucket. -- **RDS PostgreSQL Instance:** Production database (replacing the local Docker Postgres). -- **S3 Storage Bucket:** A bucket for Django's static and media files (replacing the local Minio setup). +lifewatch_batch_platform/terraform/environments/dev/readme.md -### Environment -We use `zappa_settings.template.json` as our base. The secrets are provided and replaced through environment variables. If deploying locally, export these variables in the terminal or use an env file. In CI/CD, set these as pipeline secrets. +## Security and Secrets -These must be set: -``` -export AWS_REGION="eu-west-1" -export S3_ZAPPA_BUCKET_NAME="your-zappa-deploy-bucket" -export DJANGO_SECRET_KEY="your_secure_secret_key" -export DATABASE_URL="postgres://user:url_encoded_password@rds-host:5432/dbname" -export AWS_STORAGE_BUCKET_NAME="your-app-storage-bucket" -export ALLOWED_HOSTS="your-api-gateway-url.amazonaws.com" -export WORKER_CALLBACK_URL="https://your-api-gateway-url.amazonaws.com" -export WORKER_WEBHOOK_SECRET="your_secret" -``` +- Never commit API keys, AWS keys, or Terraform state. +- Required credentials must be injected via GitHub Actions secrets or secured local environment configuration. +- E2E API authentication is validated through both positive and negative tests. +- For local Terraform runs, var files may be used, for example terraform apply -var-file secrets.tfvars. +- Preferred CI pattern is environment variables and secret stores instead of long-lived local secret files. -### Build and Deploy -Generate the final configuration file and deploy: -``` -# Generate the active settings file by replacing placeholders -envsubst < zappa_settings.template.json > zappa_settings.json +## GitHub Secrets by Workflow -# Deploy for the first time -zappa deploy dev +The following repository secrets are required by CI/CD workflows. -# OR, update an existing deployment -zappa update dev -``` +| Workflow | Required secrets | +|---|---| +| Smoke Test Worker Containerization | None | +| Deploy Worker Image to ECR | AWS_ACCESS_KEY, AWS_SECRET_KEY | +| Terraform CI | TERRAFORM_ENV_DIR, AWS_ACCESS_KEY, AWS_SECRET_KEY, CONTAINER_IMAGE, BATCH_EXECUTION_ROLE_ARN | +| Notebook E2E Deploy and Run | TERRAFORM_ENV_DIR, AWS_ACCESS_KEY, AWS_SECRET_KEY, CONTAINER_IMAGE, BATCH_EXECUTION_ROLE_ARN | + +Notes: +- TERRAFORM_ENV_DIR should point to the active environment directory, for example lifewatch_batch_platform/terraform/environments/dev. +- Rotate AWS credentials regularly and prefer short-lived credentials where possible. + +## Troubleshooting + +### Terraform CI fails at plan + +- Confirm TERRAFORM_ENV_DIR points to a valid environment folder. +- Verify CONTAINER_IMAGE and BATCH_EXECUTION_ROLE_ARN are set in repository secrets. +- Re-run terraform init locally in the same environment directory to reproduce. + +### terraform test fails after module refactor + +- Check whether tests still reference removed resources after architecture changes. +- Update assertions to match source of truth (for example OpenAPI body imports versus explicit method resources). +- Run module tests locally before pushing. + +### E2E fails with API auth errors + +- Validate API key output from Terraform and ensure request header uses x-api-key. +- Run the negative API key check expectations: 401 or 403 should be treated as pass for invalid keys. +- Confirm gateway-level CORS responses are present for API Gateway-generated errors. + +### E2E completes but leaves compute instances + +- Verify EC2 cleanup step logs in the E2E workflow artifacts. +- Check for instances tagged with lifewatch-batch-ec2-* and terminate manually if needed. +- Keep terraform destroy disabled for shared dev environment policy compliance. + +### Worker image deploy does not trigger downstream E2E + +- Confirm Deploy Worker Image to ECR ran on main and completed successfully. +- Check workflow_run trigger conditions and branch filters in dependent workflows. +- Use manual workflow_dispatch as a controlled fallback. + +## Notes on Scope + +- This repository is focused on infrastructure and delivery workflows. +- Application-level backend behavior and frontend runtime details are documented in their respective subproject guides. diff --git a/terraform-bootstrap/readme.md b/terraform-bootstrap/readme.md index 714c8eb..74ec7d6 100644 --- a/terraform-bootstrap/readme.md +++ b/terraform-bootstrap/readme.md @@ -1,33 +1,47 @@ ---- - -# Terraform S3 Backend & DynamoDB Lock Setup +# Terraform Bootstrap -This repository contains the Terraform bootstrap configuration for setting up a remote backend for Terraform using S3 and DynamoDB locking. +Bootstrap stack for shared Terraform remote state resources.Expand commentComment on line R3Resolved +This configuration allows us to store Terraform state on AWS S3, which in turn allows us to collaborate more easily, as there is a single point of truth for the state of the infrastructure. +This configuration allows us to store Terraform state on AWS S3, which in turn allows us to collaborate more easily, as there is a single point of truth for the state of the infrastructure. -It ensures that: +## Purpose -- Terraform state is stored remotely in a versioned and encrypted S3 bucket. -- Concurrent Terraform runs are prevented using a DynamoDB table for state locking. -- Only authorized IAM users can access and modify Terraform state. +This configuration creates: ---- +- an S3 bucket for Terraform state storage (`lifewatch-terraform-state-eu-west-1`); +- a DynamoDB table for state locking (`lifewatch-terraform-locks`); +- an S3 bucket policy granting access to the IAM principals listed in `variables.tf`. -## Components +## Files -### 1. S3 Bucket +- `provider.tf`: AWS provider configuration. +- `main.tf`: Terraform and provider version constraints. +- `bucket.tf`: S3 state bucket, versioning, and encryption. +- `bucket_policy.tf`: IAM principal access policy for the state bucket. +- `dynamodb.tf`: lock table definition. +- `variables.tf`: configurable values (region, bucket name, table name, authorised principals). -- Stores Terraform state files (`.tfstate`). -- Features enabled: - - Versioning – protects against accidental deletion. - - Server-side encryption (AES256) – encrypts state at rest. +## Usage -- Access is restricted via bucket policy to specific IAM users. +```bash +cd terraform-bootstrap +terraform init +terraform apply +``` -### 2. DynamoDB Table +## Backend Example For Environment Stacks -- Used for state locking, preventing multiple users or CI jobs from modifying the state simultaneously. -- Table has a single primary key (`LockID`). -- Terraform automatically creates and deletes lock entries during `apply`. +```hcl +terraform { + backend "s3" { + bucket = "lifewatch-terraform-state-eu-west-1" + key = "lifewatch-high-compute/dev/terraform.tfstate" + region = "eu-west-1" + dynamodb_table = "lifewatch-terraform-locks" + encrypt = true + } +} +``` ### 3. IAM Users & Permissions @@ -107,8 +121,7 @@ terraform apply --- -## Notes & Best Practices +## Operational Notes -- Do not share the bucket with untrusted users; Terraform state may contain sensitive information (passwords, secrets, ARNs). -- Use different keys per environment (`dev`, `staging`, `prod`) in the same bucket for separation. -- Enable DynamoDB locking to prevent simultaneous Terraform runs. +- Terraform state can contain sensitive values; treat bucket access as privileged. +- Use distinct backend `key` values per environment (for example `dev`, `staging`, `prod`). \ No newline at end of file