Skip to content

Fine-tuned Llama 3.1 model designed to detect accidental hardcoded secrets and credentials in source code

License

Notifications You must be signed in to change notification settings

bcgov/Smart-Secrets-Scanner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Smart-Secrets-Scanner

Project Overview and Repo Purpose

This project fine‑tunes Meta Llama 3.1 (8B) using LoRA/QLoRA to detect accidental hardcoded secrets (API keys, tokens, passwords, etc.) in source code. It uses a comprehensive WSL/CUDA environment setup for GPU‑accelerated training and deterministic inference.

πŸš€ Systematic Workflow Execution: Following the CUDA-ML-ENV-SETUP.md protocol, we've successfully completed environment setup, dataset preparation (72 examples), base model download (15-30GB), and are currently executing LoRA fine-tuning. The project uses a task-based tracking system with real-time git commits to maintain reproducible progress.

🎯 Current Status (November 2025): LoRA fine-tuning completed successfully with 8 training examples (1:43 runtime, train loss: 0.776). Next steps: adapter merging β†’ GGUF conversion β†’ Ollama deployment.

The repository provides a reproducible pipeline and scripts to prepare JSONL datasets, train adapters, merge and export models (GGUF), run evaluations, and deploy to runtimes such as Ollama or Hugging Face. This project demonstrates systematic GPU‑accelerated fine‑tuning with comprehensive task tracking and follows BC Gov licensing and governance guidance.

⚠️ Important: this repository is a demonstration and research project. The embedded "Smart Secrets Scanner" examples are intended for experimentation and testing of CUDA-accelerated fine-tuning only. They are not a production-grade secret-scanning solution and must not be used as a replacement for established commercial or enterprise secret-scanning tools (for example, Snyk or Wiz). Use this project to learn and validate model workflows, and rely on proven scanning products for operational security.

Prerequisites

  • WSL2 (Ubuntu) with NVIDIA GPU drivers (latest version)
  • Python 3.11 (managed by our custom setup)
  • CUDA 12.6 with PyTorch 2.9.0+cu126
  • Git LFS for large file handling

One-time activities

1. ensure in windows you have ubuntu and WSL setup

see tasks\done\01-setup-wsl2-ubuntu.md

2. install NVDIA cuda drivers

see tasks\done\02-install-nvidia-cuda-drivers.md

For detailed environment setup instructions, see CUDA-ML-ENV-SETUP.md.

Quick Environment Check: If you have an existing ~/ml_env from another compatible project, try activating it first and running the verification scripts. Only run the full setup if verification fails.


Fine-tuning overview

Complete Fine-Tuning Workflow

sequenceDiagram
    participant User
    participant Data as πŸ“Š Data Preparation
    participant Base as πŸ”½ Base Model
    participant Train as πŸŽ“ Fine-Tuning
    participant Export as πŸ“¦ Export
    participant Test as πŸ§ͺ Testing
    participant Deploy as πŸš€ Deployment

    User->>Data: 1. Create JSONL training data
    Note over Data: data/processed/smart-secrets-scanner-dataset.jsonl<br/>(72 validated examples - LLM-driven generation)
    
    User->>Base: 2. Download base model
    Note over Base: models/base/Meta-Llama-3.1-8B/<br/>(15-30 GB from Hugging Face)
    
    Data->>Train: 3. Load training data
    Base->>Train: 4. Load base model
    Train->>Train: 5. Fine-tune with LoRA/QLoRA
    Note over Train: outputs/checkpoints/<br/>outputs/logs/ (currently running)
    
    Train->>Export: 6. Merge base + adapter
    Note over Export: models/merged/<br/>(full model weights)
    
    Export->>Export: 7. Convert to GGUF
    Note over Export: models/gguf/<br/>smart-secrets-scanner.gguf
    
    Export->>Test: 8. Test with JSONL
    Note over Test: evaluation/test.jsonl<br/>Calculate precision/recall/F1
    
    Export->>Test: 9. Test with raw files
    Note over Test: Feed .py, .js, .yaml files<br/>Check secret detections
    
    Test->>Deploy: 10. Deploy to Ollama/llama.cpp
    Note over Deploy: Create Modelfile<br/>Run pre-commit scans
    
    Deploy-->>User: βœ… Fine-tuned model ready!
Loading

Workflow Overview

graph TD
    subgraph "Phase 1: Environment & Data (30-60 min)"
        A["<i class='fa fa-cogs'></i> setup_cuda_env.py<br/>*Unified environment setup*<br/>*Installs CUDA + PyTorch*<br/>&nbsp;"]
        A1["<i class='fa fa-check-circle'></i> Verify Environment<br/>*Run test scripts*<br/>*Check CUDA availability*<br/>&nbsp;"]
        B["<i class='fa fa-pen-ruler'></i> LLM-Driven Dataset Creation<br/>*72 training examples*<br/>*JSONL format with secrets*<br/>&nbsp;"]
        B1["<i class='fa fa-search'></i> validate_dataset.py<br/>*Validate data quality*<br/>*Check schema & balance*<br/>&nbsp;"]
        A_out(" <i class='fa fa-folder-open'></i> ~/ml_env venv")
        A1_out(" <i class='fa fa-check'></i> Environment Verified")
        B_out(" <i class='fa fa-file-alt'></i> smart-secrets-scanner-train.jsonl")
        B1_out(" <i class='fa fa-certificate'></i> Data Validated")
    end

    subgraph "Phase 2: Model Forging (1-3 hours)"
        C0["<i class='fa fa-download'></i> download_model.sh<br/>*Downloads Llama 3.1 8B*<br/>*15-30 GB from HF*<br/>&nbsp;"]
        C1["<i class='fa fa-cog'></i> training_config.yaml<br/>*Configure LoRA params*<br/>*Set batch size, epochs*<br/>&nbsp;"]
        C["<i class='fa fa-microchip'></i> fine_tune.py<br/>*QLoRA fine-tuning with optimizations*<br/>*15 epochs, 3-4x faster training (pending validation)*<br/>*Resume from checkpoints, system monitoring*<br/>&nbsp;"]
        C2["<i class='fa fa-chart-line'></i> Monitor Training<br/>*TensorBoard logs*<br/>*Check validation loss*<br/>&nbsp;"]
        C0_out(" <i class='fa fa-cube'></i> Base Model (15-30GB)")
        C1_out(" <i class='fa fa-file-code'></i> Training Config")
        C_out(" <i class='fa fa-puzzle-piece'></i> LoRA Adapter")
        C2_out(" <i class='fa fa-chart-bar'></i> Training Metrics")
    end

    subgraph "Phase 3: Packaging & Publishing (30-60 min)"
        D1["<i class='fa fa-compress-arrows-alt'></i> merge_adapter.py<br/>*Merge base + LoRA*<br/>*Create full model*<br/>&nbsp;"]
        D2["<i class='fa fa-compress-arrows-alt'></i> convert_to_gguf.py<br/>*Convert to GGUF*<br/>*Q4_K_M + Q8_0 quantization*<br/>&nbsp;"]
        D3["<i class='fa fa-vial'></i> Test Merged Model<br/>*Quick inference test*<br/>*Verify model works*<br/>&nbsp;"]
        E["<i class='fa fa-upload'></i> upload_to_huggingface.py<br/>*Publish to HF Hub*<br/>*Share with community*<br/>&nbsp;"]
        D1_out(" <i class='fa fa-puzzle-piece'></i> Merged Model (15GB)")
        D2_out(" <i class='fa fa-cube'></i> GGUF Model (4-8GB)")
        D3_out(" <i class='fa fa-check'></i> Model Verified")
        E_out(" <i class='fa fa-cloud'></i> Hugging Face Hub")
    end

    subgraph "Phase 4: Local Deployment (15-30 min)"
        F["<i class='fa fa-file-code'></i> create_modelfile.py<br/>*Generate Ollama Modelfile*<br/>*Configure model settings*<br/>&nbsp;"]
        F1["<i class='fa fa-terminal'></i> ollama create<br/>*Import GGUF to Ollama*<br/>*Create local model*<br/>&nbsp;"]
        F2["<i class='fa fa-play'></i> Test Ollama Model<br/>*Interactive testing*<br/>*Verify secret detection*<br/>&nbsp;"]
        F_out(" <i class='fa fa-terminal'></i> Ollama Modelfile")
        F1_out(" <i class='fa fa-robot'></i> Ollama Model")
        F2_out(" <i class='fa fa-comments'></i> Model Tested")
    end
    
    subgraph "Phase 5: E2E Verification (30-60 min)"
        H["<i class='fa fa-power-off'></i> python scripts/evaluate.py<br/>*Run comprehensive eval*<br/>*Calculate metrics*<br/>&nbsp;"]
        I["<i class='fa fa-bolt'></i> Evaluate JSONL<br/>*Test with eval dataset*<br/>*Precision/Recall/F1*<br/>&nbsp;"]
        J["<i class='fa fa-brain'></i> Test Raw Files<br/>*Feed source code files*<br/>*Check secret detection*<br/>&nbsp;"]
        I_out(" <i class='fa fa-file-invoice'></i> Evaluation Results")
        J_out(" <i class='fa fa-file-signature'></i> Inference Outputs")
        K_out(" <i class='fa fa-check-circle'></i> Model Verified & Ready")
    end

    A -- Creates --> A_out;
    A_out --> A1;
    A1 -- Verifies --> A1_out;
    A1_out --> B;
    B -- Creates --> B_out;
    B_out --> B1;
    B1 -- Validates --> B1_out;
    B1_out --> C0;
    C0 -- Downloads --> C0_out;
    C0_out --> C1;
    C1 -- Configures --> C1_out;
    C1_out --> C;
    C -- Trains --> C_out;
    C_out --> C2;
    C2 -- Monitors --> C2_out;
    C2_out --> D1;
    D1 -- Creates --> D1_out;
    D1_out --> D2;
    D2 -- Creates --> D2_out;
    D2_out --> D3;
    D3 -- Verifies --> D3_out;
    D3_out --> E;
    E -- Pushes to --> E_out;
    E_out -- Pulled for --> F;
    F -- Creates --> F_out;
    F_out --> F1;
    F1 -- Imports --> F1_out;
    F1_out --> F2;
    F2 -- Tests --> F2_out;
    F2_out -- Enables --> H;
    H -- Executes --> I;
    H -- Executes --> J;
    I -- Yields --> I_out;
    J -- Yields --> J_out;
    I_out & J_out --> K_out;

    classDef script fill:#e8f5e8,stroke:#333,stroke-width:2px;
    classDef artifact fill:#e1f5fe,stroke:#333,stroke-width:1px,stroke-dasharray: 5 5;
    classDef planned fill:#fff3e0,stroke:#888,stroke-width:1px,stroke-dasharray: 3 3;

    class A,A1,B,B1,C0,C1,C,C2,D1,D2,D3,E,F,F1,F2,H,I,J script;
    class A_out,A1_out,B_out,B1_out,C0_out,C1_out,C_out,C2_out,D1_out,D2_out,D3_out,E_out,F_out,F1_out,F2_out,I_out,J_out,K_out artifact;
    class D1,D2,E,F,H,I,J,D1_out,D2_out,E_out,F_out,I_out,J_out,K_out planned;
Loading

Step-by-Step Process

⚠️ Note on Datasets: Training data files (.jsonl) are intentionally excluded from this repository via .gitignore to avoid triggering GitHub's secret scanning on example secrets. The dataset structure and templates are documented in data/README.md. You can generate your own training data using the provided validation script.

Phase 1: Data Preparation πŸ“Š

  1. Create JSONL training data β†’ data/processed/smart-secrets-scanner-dataset.jsonl
    • 72 validated examples with instruction/input/output format (LLM-driven generation)
    • Covers secrets (AWS, Stripe, GitHub tokens, API keys) and safe patterns (env vars, test data)
    • Architecture Decision Record: See adrs/0007-llm-driven-dataset-creation.md
  2. Validate training dataset β†’ Comprehensive quality checks
    • Schema validation, balance verification, secret pattern coverage

Phase 2: Model Fine-Tuning πŸŽ“

  1. Download base model β†’ models/base/Meta-Llama-3.1-8B/
    • βœ… Completed: Successfully downloaded 15-30GB from Hugging Face
    • Uses authenticated token from .env file
  2. Fine-tune with LoRA/QLoRA β†’ Creates adapter in models/fine-tuned/
    • πŸ”„ Currently Running: python scripts/fine_tune.py executing in WSL
    • Production script with system diagnostics, checkpoint resume, and structured logging
    • Optimized configuration: 4-bit quantization, LoRA r=16, alpha=32, dropout=0.05
    • Real-time monitoring: tail -f outputs/logs/training.log
    • Expected completion: 1-3 hours on RTX 30-series GPU
  3. Review training logs β†’ outputs/logs/ (available during/after training)
    • Loss curves, validation metrics, GPU utilization tracking

Phase 3: Model Export πŸ“¦

  1. Merge base model + LoRA adapter β†’ models/merged/
    • Combine base weights with fine-tuned adapter
    • Creates full model ready for inference
  2. Convert to GGUF format β†’ models/gguf/smart-secrets-scanner.gguf
    • Use llama.cpp converter
    • Optimized for CPU/GPU inference
  3. Quantize GGUF (optional)
    • Q4_K_M (smaller, faster), Q8_0 (larger, more accurate)

Phase 4: Testing & Deployment πŸ§ͺπŸš€

  1. Test with evaluation JSONL β†’ Calculate precision, recall, F1 score
  2. Test with raw files β†’ Feed complete .py/.js/.yaml files to model
  3. Deploy to Ollama β†’ Create Modelfile, import GGUF, run locally
  4. Integrate with pre-commit hooks β†’ Scan code before git commits

Current Project Status (November 2025)

βœ… Completed Phases

  • Phase 0: Repository setup, WSL configuration, NVIDIA drivers, llama.cpp build
  • Phase 1: Environment setup with setup_cuda_env.py, surgical CUDA binary installations
  • Phase 1: Dataset creation (72 examples via LLM-driven approach), validation completed
  • Phase 2: Base model download (Llama-3.1-8B, 15-30GB) successfully completed

πŸ”„ Currently Executing

  • Phase 2: LoRA fine-tuning actively running in WSL terminal
  • Real-time progress: Loading base model β†’ configuring 4-bit quantization β†’ training loop
  • Monitoring: tail -f outputs/logs/training.log
  • Expected completion: 1-3 hours depending on GPU

πŸ“‹ Next Steps (Pending Fine-tuning Completion)

  • Phase 2: Merge LoRA adapter with base model
  • Phase 3: Convert merged model to GGUF format (Q4_K_M quantization)
  • Phase 3: Create Ollama Modelfile and test deployment
  • Phase 4: Comprehensive evaluation and Hugging Face upload

πŸ“Š Key Metrics

  • Dataset: 72 validated training examples (JSONL format)
  • Model: Llama-3.1-8B base with LoRA fine-tuning (r=16, alpha=32)
  • Hardware: RTX 30-series GPU with CUDA 12.6
  • Environment: Python 3.11, PyTorch 2.9.0+cu126

πŸ“ Task Tracking

All progress tracked in tasks/ directory with real-time git commits:

  • βœ… Tasks 58-67: Environment setup and model preparation
  • πŸ”„ Task 68: Fine-tuning (in progress)
  • πŸ“‹ Tasks 69-74: Post-training workflow (pending)

Folder Structure

Smart-Secrets-Scanner/
β”œβ”€β”€ adrs/                           # Architecture Decision Records
β”œβ”€β”€ scripts/                        # Bash scripts for setup, training, inference
β”œβ”€β”€ notebooks/                      # Jupyter notebook templates
β”œβ”€β”€ tasks/                          # Task tracking (backlog, in-progress, done)
β”œβ”€β”€ data/                           # Training and evaluation datasets
β”‚   β”œβ”€β”€ raw/                        # Original unprocessed data
β”‚   β”œβ”€β”€ processed/                  # JSONL training data (put your .jsonl files here)
β”‚   └── evaluation/                 # Test sets and benchmarks
β”œβ”€β”€ models/                         # Model files
β”‚   β”œβ”€β”€ base/                       # Base pre-trained models (downloaded)
β”‚   β”œβ”€β”€ fine-tuned/                 # Fine-tuned adapters (LoRA/QLoRA)
β”‚   β”œβ”€β”€ merged/                     # Merged models (base + adapter)
β”‚   └── gguf/                       # Quantized models for deployment
β”œβ”€β”€ outputs/                        # Training outputs
β”‚   β”œβ”€β”€ checkpoints/                # Training checkpoints
β”‚   β”œβ”€β”€ logs/                       # Training logs and metrics
β”‚   └── temp/                       # Temporary files during training
β”œβ”€β”€ .gitignore                      # Excludes large model files and sensitive data
└── README.md                       # Project documentation

**Key directories for your Smart Secrets Scanner use case:**
- **`data/processed/`** β†’ Put your JSONL training data here (e.g., `smart-secrets-scanner-train.jsonl`)
- **`models/base/`** β†’ Base model downloads (e.g., [Llama 3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B))
- **`models/fine-tuned/`** β†’ Your trained LoRA adapters
- **`models/merged/`** β†’ Merged full models (base + adapter)
- **`models/gguf/`** β†’ Quantized models ready for Ollama deployment

---

### CLI Scripts for workflow

#### Approach 1: CLI Scripts (Production Workflow)

**Best for**: Automation, reproducibility, CI/CD pipelines, production deployment

#### Quick Start - CLI

```bash
# Phase 1: Environment & Data Preparation (Completed βœ…)
# Step 1: Environment setup completed with setup_cuda_env.py
# Step 2: Dataset created and validated (72 examples)
# Step 3: Base model downloaded successfully

# Phase 2: Model Fine-Tuning (Currently Running πŸ”„)
# Step 4: Fine-tuning in progress
python scripts/fine_tune.py  # Currently executing in WSL

# Step 5: Monitor training progress
tail -f outputs/logs/training.log

# Phase 3: Model Export (Next Steps πŸ“‹)
# Step 6: Merge adapter with base model
python scripts/merge_adapter.py --skip-sanity

# Step 7: Convert to GGUF format
python scripts/convert_to_gguf.py --quant Q4_K_M --force

# Step 8: Create Ollama Modelfile
python scripts/create_modelfile.py

# Step 9: Deploy to Ollama
ollama create smart-secrets-scanner -f Modelfile
ollama run smart-secrets-scanner

# Phase 4: Testing & Publishing
# Step 10: Run comprehensive evaluation
python scripts/evaluate.py

# Step 11: Upload to Hugging Face
python scripts/upload_to_huggingface.py --repo yourusername/smart-secrets-scanner --gguf --modelfile --readme

python scripts/merge_adapter.py

Step 8: Convert to GGUF

python scripts/convert_to_gguf.py

Step 9: Quantize (automatic in step 8)

Phase 4: Testing & Deployment (Steps 10-13)

Step 10: Test with evaluation JSONL

python scripts/evaluate.py

Step 11: Test with raw files

python scripts/inference.py --batch data/raw/

Step 12: Deploy to Ollama

python scripts/create_modelfile.py ollama create smart-secrets-scanner -f Modelfile

Step 13: Integrate with pre-commit hooks

python scripts/scan_secrets.py --file examples/test.py


### Important note: local GGUF vs deployed model differences

If you see differing behaviour between a locally-run GGUF model (for example, running with `ollama run` or a local gguf file) and a model referenced via a Modelfile that points to a Hugging Face-deployed version, this is expected to sometimes occur β€” and is important to document:

- Observed: the local GGUF model produced correct/expected detections, but switching the Modelfile to point at the Hugging Face hosted model produced different outputs (fewer alerts / different classification).

Possible causes and quick checklist to diagnose:

1. Model variant mismatch: ensure the Modelfile `FROM` path references the exact same model and revision you tested locally (same commit/gguf file). A different model ID, revision, or quantization will change outputs.
2. Quantization and format: local GGUF may be quantized (Q4/K/M etc.); the deployed HF model may be a different numeric format (FP16, FP32) or different tokenizer settings.
3. Tokenizer / special tokens: confirm the tokenizer and special tokens are identical. Small tokenizer differences change prompts and results.
4. Generation/config differences: check temperature, top_p, seed, max_new_tokens, and any system/prompt wrappers used by the hosted deployment.
5. Prompt formatting and instruction wrapper: some deployments add instruction/system wrappers or strip newlines. Compare the exact prompt payload you send locally vs to HF endpoint.
6. Merge step: verify you merged the LoRA adapter into the base correctly for the deployed model (or that the hosted model already includes the adapter weights).
7. Environment and runtime differences: model serving infra (batching, safety filters, or post-processing) can change outputs.

Suggested reproduction steps:

1. Save the exact prompt you used locally (including wrappers).
2. Run a deterministic local inference with a fixed seed and generation params and save the response.
3. Call the hosted model with the identical prompt and generation parameters (where possible) and compare responses.
4. If outputs differ, check the model revision, tokenizer files, and Modelfile `FROM` line. Also compare any Modelfile-run-time environment variables or post-processing steps.

If you'd like, I can add a small troubleshooting section to `EXECUTION_GUIDE.md` with exact curl/examples for calling the HF endpoint and `ollama run` side-by-side to make this comparison repeatable.

πŸ“– **For detailed CLI instructions, see [EXECUTION_GUIDE.md](EXECUTION_GUIDE.md)**  
πŸ“‹ **For quick command reference, see [QUICK_REFERENCE.md](QUICK_REFERENCE.md)**

---

## Approach 2: Jupyter Notebooks (Interactive Workflow)

**Best for**: Learning, experimentation, data exploration, iterative development

> ⚠️ NOTE: The notebooks in `notebooks/` are provided as illustrative examples only. The project’s tested and validated workflow uses the scripts in the `scripts/` folder (for example, `scripts/fine_tune.py`, `scripts/validate_dataset.py`, `scripts/evaluate.py`). I have exercised the scripts-based approach locally (path example: `C:\Users\RICHFREM\source\repos\Smart-Secrets-Scanner\scripts`) β€” the notebooks themselves have not been executed or validated as part of the current tests. Use the notebooks for exploration only and prefer the scripts for reproducible runs.
⚠️ NOTE: The notebooks in `notebooks/` are provided as illustrative examples only. The project’s tested and validated workflow uses the scripts in the `scripts/` folder (for example, `scripts/fine_tune.py`, `scripts/validate_dataset.py`, `scripts/evaluate.py`). I have exercised the scripts-based approach locally (path example: `./scripts`) β€” the notebooks themselves have not been executed or validated as part of the current tests. Use the notebooks for exploration only and prefer the scripts for reproducible runs.

### Quick Start - Notebooks

Open and run notebooks in order:

1. **`notebooks/01_data_exploration.ipynb`** (Phase 1: Steps 1-3)
   - Inspect and validate JSONL training data
   - Analyze class balance and examples
   - Visualize dataset statistics

2. **`notebooks/02_fine_tuning_interactive.ipynb`** (Phase 2: Steps 4-6)
   - Setup environment and download model
   - Fine-tune with real-time progress tracking
   - Visualize loss curves and metrics

3. **`notebooks/03_model_evaluation.ipynb`** (Phase 3-4: Steps 7-11)
   - Merge adapter and convert to GGUF
   - Run comprehensive evaluation
   - Test inference on code samples

4. **`notebooks/04_deployment_testing.ipynb`** (Phase 4: Steps 12-13)
   - Create Modelfile and deploy to Ollama
   - Test pre-commit scanning workflow
   - Validate production deployment

πŸ““ **Interactive learning path - run cells sequentially for guided workflow**

---

## Project Structure

- **`scripts/`** - Shell and Python scripts for training and deployment
- **`config/`** - Configuration files (training hyperparameters)
- **`data/`** - Training datasets (JSONL format)
- **`models/`** - Base models, fine-tuned adapters, and GGUF exports
- **`outputs/`** - Training checkpoints, logs, and merged models
- **`tasks/`** - Task tracking and project management

## Key Scripts

| Script | Purpose | Task |
|--------|---------|------|
| `scripts/fine_tune.py` | Train LoRA adapter | Task 36 |
| `scripts/merge_adapter.py` | Merge adapter with base model | Task 38 |
| `scripts/convert_to_gguf.py` | Convert to GGUF format | Task 39 |
| `scripts/evaluate.py` | Calculate metrics | Task 32 |
| `scripts/scan_secrets.py` | Pre-commit scanning | Task 41 |

See [SCRIPTS_TASKS_MAPPING.md](SCRIPTS_TASKS_MAPPING.md) for complete script documentation.

## Documentation

- **[EXECUTION_GUIDE.md](EXECUTION_GUIDE.md)** - Step-by-step execution instructions
- **[SCRIPTS_TASKS_MAPPING.md](SCRIPTS_TASKS_MAPPING.md)** - Scripts to tasks mapping
- **[data/README.md](data/README.md)** - Data folder structure and workflow
- **[tasks/](tasks/)** - Task tracking (backlog, in-progress, done)
- **[adrs/](adrs/)** - Architecture Decision Records

## Governance, licensing and repository requirements

This repository is published under the Apache License 2.0. The following items are recommended or required for BC Government GitHub repositories and our internal policies:

- Include a `LICENSE` file at the repository root (this project already contains `LICENSE` β€” Apache-2.0).
- Keep a short license/footer in top-level documentation (see footer at the bottom of this README).
- Add an Apache license header to distributed source files (scripts, Python files, etc.). See the `LICENSE` file for the exact wording. A short example header to include near the top of scripts is:

```text
Copyright 2025 British Columbia

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

If you run CI checks, add a check to ensure new/modified source files include the required header.

Prefer referencing repository-provided verification scripts rather than generic commands. See "Verify the environment & scripts" below.

Verify the environment & scripts

For environment setup and verification, see CUDA-ML-ENV-SETUP.md.

Repository-specific verification:

# Validate training data
python scripts/validate_dataset.py data/processed/smart-secrets-scanner-train.jsonl

# Run evaluation (after training/merge)
python scripts/evaluate.py --test-data data/evaluation/smart-secrets-scanner-test.jsonl

# Run pre-commit / scanner locally against a file
python scripts/scan_secrets.py --file examples/test.py


---

Copyright 2025 British Columbia

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Notes

  • Environment: All scripts designed for Bash/WSL2 (not PowerShell/Windows CMD)
  • Dependencies: Validated with PyTorch 2.9.0+cu126, CUDA 12.6, Python 3.11
  • Dataset: 72 validated training examples (LLM-driven generation, comprehensive secret coverage)
  • Current Status: Fine-tuning actively running in WSL (1-3 hours remaining)
  • Task Tracking: Comprehensive progress documented in tasks/ with git commits
  • Next Steps: Adapter merging β†’ GGUF conversion β†’ Ollama deployment β†’ evaluation

About

Fine-tuned Llama 3.1 model designed to detect accidental hardcoded secrets and credentials in source code

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published