prathikanand7 · Kayle-Verhiel · Mar 22, 2026 · Mar 22, 2026 · Mar 22, 2026
diff --git a/lifewatch_batch_platform/terraform/environments/dev/readme.md b/lifewatch_batch_platform/terraform/environments/dev/readme.md
@@ -10,7 +10,8 @@ API Gateway (REST)
     ├── POST /batch/jobs              → lambda_batch_trigger  ──► AWS Batch
     ├── GET  /batch/jobs/{id}         → lambda_job_status     ──► AWS Batch
     ├── GET  /batch/jobs/{id}/logs    → lambda_job_logs       ──► CloudWatch
-    └── GET  /batch/jobs/{id}/results → lambda_job_results    ──► S3
+    ├── GET  /batch/jobs/{id}/results → lambda_job_results    ──► S3
+    └── GET  /batch/jobs/history_list → lambda_job_history_list
                                                                     ▲
                                                              Batch workers
                                                           (Fargate or EC2)
@@ -60,10 +61,11 @@ Each Lambda function expects a ZIP file to exist at the path configured in `terr
 
 | Variable | Default path | Handler |
 |---|---|---|
-| `lambda_trigger_filename` | `lambda.zip` | `lambda_function.lambda_handler` |
-| `lambda_status_filename` | `status_lambda.zip` | `status.lambda_handler` |
-| `lambda_logs_filename` | `logs_lambda.zip` | `logs.lambda_handler` |
-| `lambda_results_filename` | `results_lambda.zip` | `results.lambda_handler` |
+| `lambda_trigger_filename` | `../../backend_lambda_artifacts/lambda.zip` | `lambda_function.lambda_handler` |
+| `lambda_status_filename` | `../../backend_lambda_artifacts/status_lambda.zip` | `status.lambda_handler` |
+| `lambda_logs_filename` | `../../backend_lambda_artifacts/logs_lambda.zip` | `logs.lambda_handler` |
+| `lambda_results_filename` | `../../backend_lambda_artifacts/results_lambda.zip` | `results.lambda_handler` |
+| `lambda_history_list_filename` | `../../backend_lambda_artifacts/history_list_lambda.zip` | `history_list.lambda_handler` |
 
 Paths are relative to the `environments/dev/` directory. If your ZIPs live elsewhere, update the paths in `terraform.tfvars` accordingly (e.g. `"../../lambdas/lambda.zip"`).
 
@@ -81,7 +83,8 @@ vpc
       │    │    ├── lambda_batch_trigger
       │    │    ├── lambda_job_status
       │    │    ├── lambda_job_logs
-      │    │    └── lambda_job_results
+      │    │    ├── lambda_job_results
+      │    │    └── lambda_job_history_list
       │    ├── batch_job_definition_fargate
       │    └── batch_job_definition_ec2
       │
@@ -100,6 +103,7 @@ api_gateway ──► lambda_batch_trigger
             ──► lambda_job_status
             ──► lambda_job_logs
             ──► lambda_job_results
+            ──► lambda_job_history_list
 
 api_key_usage_plan ──► api_gateway
 ```
@@ -125,6 +129,7 @@ api_key_usage_plan ──► api_gateway
 | `lambda_job_status` | `modules/lambda/lambda_job_status` | Handles GET /batch/jobs/{id} |
 | `lambda_job_logs` | `modules/lambda/lambda_job_logs` | Handles GET /batch/jobs/{id}/logs |
 | `lambda_job_results` | `modules/lambda/lambda_job_results` | Handles GET /batch/jobs/{id}/results |
+| `lambda_job_history_list` | `modules/lambda/lambda_job_history_list` | Handles GET /batch/jobs/history_list |
 | `api_gateway` | `modules/api_gateway` | REST API with all routes, Lambda integrations, CORS, and auto-redeployment trigger |
 | `api_key_usage_plan` | `modules/api_key_usage_plan` | API key and usage plan with throttling and optional quota |
 

diff --git a/lifewatch_batch_platform/terraform/modules/api_gateway/readme.md b/lifewatch_batch_platform/terraform/modules/api_gateway/readme.md
@@ -22,12 +22,14 @@ This module is primarily **OpenAPI-driven**:
 
 | Resource | Description |
 |---|---|
-| `aws_api_gateway_rest_api` | REST API body imported from `openapi.yaml` (including functional methods and OPTIONS preflight methods) |
-| `aws_api_gateway_gateway_response` | Gateway-level error responses with CORS headers (`DEFAULT_4XX`, `DEFAULT_5XX`, `MISSING_AUTHENTICATION_TOKEN`, `RESOURCE_NOT_FOUND`) |
-| `aws_api_gateway_deployment` | Versioned deployment with OpenAPI hash trigger |
-| `aws_api_gateway_stage` | Deploys the active API to `stage_name` |
-
----
+| `aws_api_gateway_rest_api` | The REST API with multipart/form-data binary support |
+| `aws_api_gateway_resource` | Routes: `/batch`, `/batch/jobs`, `/batch/jobs/{job_id}`, `/batch/jobs/{job_id}/logs`, `/batch/jobs/{job_id}/results`, `/batch/jobs/history_list` |
+| `aws_api_gateway_method` | `POST /batch/jobs`, `GET` on status/logs/results/history_list, `OPTIONS` on all routes |
+| `aws_api_gateway_integration` | `AWS_PROXY` Lambda integrations for all functional routes; `MOCK` integrations for CORS preflight |
+| `aws_api_gateway_method_response` | `200` responses declaring CORS headers on all OPTIONS routes |
+| `aws_api_gateway_integration_response` | Injects `Access-Control-Allow-*` headers into preflight responses |
+| `aws_api_gateway_deployment` | Versioned deployment with sha1 change-detection trigger |
+| `aws_api_gateway_stage` | Deploys the API to the configured stage name |
 
 ## Routes (from OpenAPI)
 

diff --git a/readme.md b/readme.md
@@ -1,155 +1,175 @@
-# Notebook Platform - Local Setup Guide
+# LifeWatch Notebook Platform DevOps
+
+Infrastructure, CI/CD, and validation assets for running notebooks through AWS Batch behind an API Gateway and Lambda control plane.
+
+## Components
+
+- Terraform infrastructure for the dev environment.
+- Lambda handlers and API request client scripts.
+- Worker container build and publish pipeline.
+- End-to-end notebook validation on AWS.
+- Frontend operator UI for job submission and history.
+- Demo notebook fixtures, including lightweight examples.
+
+## Setup Decisions
+
+The dev environment is shared by teammates.
+
+- Notebook E2E workflows must not run terraform destroy.
+- Cleanup is limited to transient AWS Batch EC2 instances created during execution.
+- Terraform-managed infrastructure is intentionally preserved after E2E runs.
 
 ## Prerequisites
 
-- [Docker](https://docs.docker.com/get-docker/)
-- [Docker Compose](https://docs.docker.com/compose/install/)
+- Terraform 1.6 or newer.
+- AWS CLI configured with credentials that can manage the target dev stack.
+- Docker (for worker image build and local checks).
+- Node.js 18+ and npm (for frontend local runs).
+- Python 3.10+ (for helper scripts and selected workflows).
 
-## Project Structure
-```
+## Repository Layout
+
+```text
 dev-ops/
-├── server/                     # Django web app + Celery worker config
-│   ├── jobs/                   # Core app (models, views, tasks)
-│   ├── notebook_platform/      # Django project settings
-│   ├── k8s-manifests/          # Kubernetes manifests (optional, not sure if we will use these w/ Terraform)
-│   ├── docker-compose.yml
-│   ├── Dockerfile
-│   └── requirements.txt
-├── worker/                     # Notebook execution worker
-│   ├── Dockerfile
-│   ├── worker.py
-│   └── inputs/
-│       └── environment.yaml    # Base env for all runners, can be updated in runtime based on input files
-└── .github/
-    └── workflows/
-        └── ci.yaml             # Define actions
+├── .github/workflows/                           # CI/CD and E2E workflows
+├── lifewatch_batch_platform/terraform/          # Terraform env + reusable modules + client scripts
+├── frontend/                                    # React + TypeScript operator UI
+├── worker/                                      # Batch worker image definition
+├── demo_input/                                  # Notebook and payload fixtures for tests/manual runs
+├── terraform-bootstrap/                         # Remote-state bootstrap resources
+└── job_profiles.json                            # Shared execution-profile catalogue
 ```
+
+## Core Workflows
+
+| Workflow | File | Purpose |
+|---|---|---|
+| Smoke Test Worker Containerization | .github/workflows/smoke-test-worker-containerization.yml | Builds worker image and validates container startup. |
+| Deploy Worker Image to ECR | .github/workflows/deploy-worker-ecr.yml | Publishes worker image tags to ECR. |
+| Terraform CI | .github/workflows/test-terraform-plan.yml | Runs fmt, validate, plan, and optional terraform test. |
+| Notebook E2E Deploy and Run | .github/workflows/e2e-notebook-deploy-and-run.yml | Applies infra, runs notebook E2E matrix, uploads artifacts, then performs EC2-only cleanup. |
+
+## Documentation Index
+
+- [Notebook E2E Runbook](.github/workflows/e2e-notebook-testing.md)
+- [Lightweight Notebook Fixtures](demo_input/lightweight-notebooks/README.md)
+- [Terraform dev Environment](lifewatch_batch_platform/terraform/environments/dev/readme.md)
+- [Frontend Guide](frontend/README.md)
+- [Terraform Bootstrap Guide](terraform-bootstrap/readme.md)
+
 ## Quick Start
 
-### 1. Build and Start All Services
-You will need to do this every time you want to develop locally.
-```
-cd server
-docker compose up --build -d
-```
-This will spin up the following server components:
-- PostgreSQL Database               # Stand-in for RDS, not used in production
-- Redis Cache (Used by Celery)      # Stand-in for ElastiCache, not used in production
-- Minio                             # Stand-in for S3, not used in production
-- Celery Service
-- Django Server
-
-### 2. Run Database Migrations (First Time ONLY)
-This is usually needed when running for the first time, or when making changes to the models.
-In a separate terminal:
-```
-docker compose exec web python manage.py migrate
-```
-### 3. Create a Superuser (First Time ONLY)
-```
-docker compose exec web python manage.py createsuperuser
-```
-### 4. Start minikube
-```
-minikube start --driver=docker
-```
-### 5. Apply rbac.yaml (First Time ONLY)
-```
-cd server
-kubectl apply -f minikube-rbac.yaml
-```
-### 6. Build Worker
-This will build the image inside minikube.
-```
-minikube image build -t r-notebook-worker:latest .
-```
-### 7. Create Network Proxy (Windows Only)
-```
-kubectl proxy --address='0.0.0.0' --port=8001 --accept-hosts='^.*'
-```
+### Frontend
 
-### 8. Access the Application
-```
-http://localhost:8000
+```bash
+cd frontend
+npm install
+npm run dev
 ```
 
-## Useful Commands
-```
-# View Jobs (-w to keep in running)
-kubectl get pods -w
+### Terraform Dev Environment
 
-# Get the logs of a specific job
-kubectl logs <NAME>
+```bash
+cd lifewatch_batch_platform/terraform/environments/dev
+terraform init
+terraform plan
+```
 
-# Run commands inside the web container
-docker compose exec web python <SOME COMMAND>
+## Deployment Order (Recommended)
 
-# Open a shell inside the web container
-docker compose exec web bash
+1. Initialize remote state bootstrap resources from terraform-bootstrap if not already provisioned.
+2. Apply the dev environment Terraform stack.
+3. Build and publish worker image to ECR.
+4. Run notebook E2E workflows.
 
-# Run tests inside docker
-docker compose exec web python manage.py test
-```
+## Onboarding Checklist
 
-## CI/CD
-A GitHub Actions workflow is located at `.github/workflows/ci.yaml` and runs automatically on push or can be triggered manually.
-It builds everything as described above, uses a cache to speed up things, and runs a GET to the login screen to verify things are running.
+Use this checklist for new contributors working on the dev environment.
 
-## Pre-commit Hooks
+1. Clone the repository and verify toolchain versions from Prerequisites.
+2. Confirm AWS access by running aws sts get-caller-identity.
+3. Ensure access to required GitHub repository secrets (see matrix below).
+4. Run Terraform init and plan from lifewatch_batch_platform/terraform/environments/dev.
+5. Run worker smoke test locally with docker build ./worker and a quick container run.
+6. Start frontend locally (npm run dev) and verify API base URL configuration.
+7. Review the E2E runbook before triggering shared-environment workflows.
 
-This repository includes a `.pre-commit-config.yaml` with fast checks for:
-- file hygiene (whitespace, EOF, YAML/JSON/TOML)
-- Python lint/format for worker and backend Lambda/client scripts
-- Terraform format + validation on changed Terraform directories
-- frontend ESLint for TypeScript/React sources
-- notebook output stripping under `demo_input`
+## Local Validation Commands
 
-Install and enable hooks:
 ```bash
-pip install pre-commit
-pre-commit install
-```
+# Terraform style and validation
+terraform fmt -recursive
+terraform -chdir=lifewatch_batch_platform/terraform/environments/dev validate
 
-Run all hooks manually:
-```bash
-pre-commit run --all-files
+# Module tests (if present)
+terraform -chdir=lifewatch_batch_platform/terraform/modules/api_gateway test
+
+# Frontend checks
+cd frontend && npm run build
 ```
 
-## AWS Deployment with Zappa
+## Terraform Environment Notes
 
-The Django server is deployed to AWS Lambda using Zappa. To keep secrets out of version control, we use a template for the Zappa configuration and inject environment variables during deployment.
+For detailed infrastructure inputs, outputs, and module dependencies, use:
 
-### AWS Prerequisites
-Before deploying, ensure the following resources and credentials exist in your AWS account:
-- **IAM User:** Credentials configured locally or in the CI/CD pipeline with permissions to manage Lambda, API Gateway, S3, and IAM roles.
-- **S3 Deployment Bucket:** A bucket for Zappa to store its zip packages during deployment. Different that the Django storage bucket.
-- **RDS PostgreSQL Instance:** Production database (replacing the local Docker Postgres).
-- **S3 Storage Bucket:** A bucket for Django's static and media files (replacing the local Minio setup).
+lifewatch_batch_platform/terraform/environments/dev/readme.md
 
-### Environment
-We use `zappa_settings.template.json` as our base. The secrets are provided and replaced through environment variables. If deploying locally, export these variables in the terminal or use an env file. In CI/CD, set these as pipeline secrets.
+## Security and Secrets
 
-These must be set:
-```
-export AWS_REGION="eu-west-1"
-export S3_ZAPPA_BUCKET_NAME="your-zappa-deploy-bucket"
-export DJANGO_SECRET_KEY="your_secure_secret_key"
-export DATABASE_URL="postgres://user:url_encoded_password@rds-host:5432/dbname"
-export AWS_STORAGE_BUCKET_NAME="your-app-storage-bucket"
-export ALLOWED_HOSTS="your-api-gateway-url.amazonaws.com"
-export WORKER_CALLBACK_URL="https://your-api-gateway-url.amazonaws.com"
-export WORKER_WEBHOOK_SECRET="your_secret"
-```
+- Never commit API keys, AWS keys, or Terraform state.
+- Required credentials must be injected via GitHub Actions secrets or secured local environment configuration.
+- E2E API authentication is validated through both positive and negative tests.
+- For local Terraform runs, var files may be used, for example terraform apply -var-file secrets.tfvars.
+- Preferred CI pattern is environment variables and secret stores instead of long-lived local secret files.
 
-### Build and Deploy
-Generate the final configuration file and deploy:
-```
-# Generate the active settings file by replacing placeholders
-envsubst < zappa_settings.template.json > zappa_settings.json
+## GitHub Secrets by Workflow
 
-# Deploy for the first time
-zappa deploy dev
+The following repository secrets are required by CI/CD workflows.
 
-# OR, update an existing deployment
-zappa update dev
-```
+| Workflow | Required secrets |
+|---|---|
+| Smoke Test Worker Containerization | None |
+| Deploy Worker Image to ECR | AWS_ACCESS_KEY, AWS_SECRET_KEY |
+| Terraform CI | TERRAFORM_ENV_DIR, AWS_ACCESS_KEY, AWS_SECRET_KEY, CONTAINER_IMAGE, BATCH_EXECUTION_ROLE_ARN |
+| Notebook E2E Deploy and Run | TERRAFORM_ENV_DIR, AWS_ACCESS_KEY, AWS_SECRET_KEY, CONTAINER_IMAGE, BATCH_EXECUTION_ROLE_ARN |
+
+Notes:
+- TERRAFORM_ENV_DIR should point to the active environment directory, for example lifewatch_batch_platform/terraform/environments/dev.
+- Rotate AWS credentials regularly and prefer short-lived credentials where possible.
+
+## Troubleshooting
+
+### Terraform CI fails at plan
+
+- Confirm TERRAFORM_ENV_DIR points to a valid environment folder.
+- Verify CONTAINER_IMAGE and BATCH_EXECUTION_ROLE_ARN are set in repository secrets.
+- Re-run terraform init locally in the same environment directory to reproduce.
+
+### terraform test fails after module refactor
+
+- Check whether tests still reference removed resources after architecture changes.
+- Update assertions to match source of truth (for example OpenAPI body imports versus explicit method resources).
+- Run module tests locally before pushing.
+
+### E2E fails with API auth errors
+
+- Validate API key output from Terraform and ensure request header uses x-api-key.
+- Run the negative API key check expectations: 401 or 403 should be treated as pass for invalid keys.
+- Confirm gateway-level CORS responses are present for API Gateway-generated errors.
+
+### E2E completes but leaves compute instances
+
+- Verify EC2 cleanup step logs in the E2E workflow artifacts.
+- Check for instances tagged with lifewatch-batch-ec2-* and terminate manually if needed.
+- Keep terraform destroy disabled for shared dev environment policy compliance.
+
+### Worker image deploy does not trigger downstream E2E
+
+- Confirm Deploy Worker Image to ECR ran on main and completed successfully.
+- Check workflow_run trigger conditions and branch filters in dependent workflows.
+- Use manual workflow_dispatch as a controlled fallback.
+
+## Notes on Scope
+
+- This repository is focused on infrastructure and delivery workflows.
+- Application-level backend behavior and frontend runtime details are documented in their respective subproject guides.