Manually Generated Multi-Model Image Dataset (HF-Hosted) + Import Script

## Issue Type

- [ ] **Model**: ML model bug, training issue, or architecture problem
- [x] **Data**: Dataset issue, preprocessing bug, or data pipeline problem  
- [ ] **Web**: Frontend bug or UI issue in the Next.js dashboard
- [ ] **API**: Backend API bug or FastAPI endpoint issue
- [ ] **Research**: Research question or experimental feature request
- [ ] **Documentation**: Documentation bug or improvement needed
- [ ] **Bug**: General bug fix needed
- [ ] **Enhancement**: New feature or improvement request

## Description
Create a **manually generated, prompt-aligned dataset** of AI-generated images using multiple image generation models (DALL-E, Midjourney, Stable Diffusion, etc.), upload it to **Hugging Face Datasets as a raw file repository**, and provide a **local import script** that pulls and parses the dataset into a usable directory structure for training and evaluation.

This ticket intentionally separates:
- **Data creation & hosting** (manual, simple)
- **Data consumption** (scripted, flexible)

## Acceptance Criteria
- [ ] Prompts generated by running `scripts/create_manual_prompts.py` on branch `data/manual-gen`
- [ ] Images generated for as many generators as possible (based on usage limits and free trials)
- [ ] Directory structure exactly matches spec
- [ ] Dataset uploaded to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images
- [ ] `README.md` present and complete
- [ ] `scripts/import_manual_dataset.py` exists and runs
- [ ] Script successfully downloads and parses dataset locally

## Deliverables

### Hugging Face Dataset
Upload dataset to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images

This dataset is hosted as **raw files** (images + prompts), NOT as a pre-built `datasets.Dataset` object.

### Import Script
Create a script at: `scripts/import_manual_dataset.py`

This script pulls the dataset from Hugging Face and reconstructs a **local, training-ready view** of the dataset.

## Additional Context

### Prompt Source (MANDATORY)
All suggested prompts already exist, try not to change them unless necessary to make the image realistic.

**To generate the prompts:**
1. Checkout the branch: `data/manual-gen`
2. Run: `scripts/create_manual_prompts.py`
3. This will create prompts at: `dataset/manual-gen-images/prompts/`

**Files:**
- `p001.txt` → `p100.txt`

Each file contains **one complete prompt**.

### Prompt Usage Rules
- Use the prompt text from `prompts/pXXX.txt`
- Generate **one image per prompt per generator**
- Try not to embellish, rewrite, or "optimize" prompts to help realism
- If a generator rejects a prompt:
  - Make the smallest possible policy-safe edit
  - Keep the same `prompt_id`
  - Document the change in `README.md`
- **IMPORTANT**: If any changes are made to any prompts for specific generators, note the change in the `pXXX.txt` file for that prompt as:
```
*Original Prompt*
*Generator*: *Adjusted Prompt*
```

### Required Directory Structure (Local + HF)

**Dataset root:** `manual-gen-images/`

#### Prompts
```
prompts/
├── p001.txt
├── p002.txt
├── ...
└── p100.txt
```

#### Images (implicit metadata via paths)
```
images/
├── dalle/
│   ├── p001.png
│   ├── p002.png
│   └── ...
├── midjourney/
│   ├── p001.png
│   └── ...
├── bing/
│   ├── p001.png
│   └── ...
├── stable_diffusion/
│   ├── p001.png
│   └── ...
├── ideogram/
│   ├── p001.png
│   └── ...
├── flux/
│   ├── p001.png
│   └── ...
└── nanobanana/
    ├── p001.png
    └── ...
```

From this structure, metadata is inferred as:
- `generator` = parent directory name
- `prompt_id` = filename (e.g., `p042`)
- `prompt_text` = contents of `prompts/p042.txt`

### Generators to Use
Attempt as many as possible from the list below. Partial completion is acceptable.

| Generator | Notes |
|-----------|-------|
| DALL-E | Use ChatGPT image generation (Plus recommended) |
| Bing Image Creator | Free standard generation available |
| Stable Diffusion | Local install OR Stability platform |
| Midjourney | Subscription required (try if access exists) |
| Ideogram | Limited free credits |
| FLUX | Available via Black Forest Labs |
| Nano Banana (Google) | Use Google AI Studio free trial (students) |

### Generator Access Notes (MUST APPEAR IN README)

> **Access Notes**
> - If you do not have ChatGPT Plus access for DALL-E, notify `lukhsaaankumar`.
> - Nano Banana (Google DeepMind) can be accessed via **Google AI Studio free trial** (student accounts supported).
> - Midjourney availability varies; free trials may or may not be active.
> - If a generator cannot be accessed, skip it and document the reason in the dataset's `README.md`.

### Image Saving Rules
- Save images as **PNG**
- Filename format: `pXXX.png`
- No upscaling, cropping, filters, or post-processing
- No watermarks

### Hugging Face Upload Strategy (IMPORTANT)
- Upload the dataset **as raw files** (folder structure preserved)
- Do NOT convert into a structured HF `Dataset` object in this ticket
- Hugging Face will host the dataset as a file-based repo
---

## Implementation Notes

### Import Script Requirements

**Script location:** `scripts/import_manual_dataset.py`

**Purpose:** This script pulls the dataset from Hugging Face and reconstructs a **local, training-ready view** of the dataset.

**Script responsibilities:**

1. Download the dataset snapshot from Hugging Face:
   - repo: `DeepFakeDetector/manual-gen-images`
   - repo_type: `dataset`

2. Parse the directory structure:
   - Iterate over `images/{generator}/pXXX.png`
   - Read corresponding `prompts/pXXX.txt`

3. Expose a clean Python representation, e.g.:
```python
   {
     "image_path": "...",
     "generator": "dalle",
     "prompt_id": "p042",
     "prompt_text": "...",
   }
```

### README.md Requirements
The dataset README must include:
- Dataset description
- Prompt-aligned design explanation
- Directory structure
- Generator list
- Access notes
- Limitations (manual generation, policy constraints)

## Definition of Done
- [x] Prompts generated by running `scripts/create_manual_prompts.py` on branch `data/manual-gen`
- [ ] Images generated for as many generators as possible
- [x] Directory structure exactly matches spec
- [x] Dataset uploaded to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images
- [x] `README.md` present and complete
- [x] `scripts/import_manual_dataset.py` exists and runs
- [x] Script successfully downloads and parses dataset locally

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually Generated Multi-Model Image Dataset (HF-Hosted) + Import Script #49

Issue Type

Description

Acceptance Criteria

Deliverables

Hugging Face Dataset

Import Script

Additional Context

Prompt Source (MANDATORY)

Prompt Usage Rules

Required Directory Structure (Local + HF)

Prompts

Images (implicit metadata via paths)

Generators to Use

Generator Access Notes (MUST APPEAR IN README)

Image Saving Rules

Hugging Face Upload Strategy (IMPORTANT)

Implementation Notes

Import Script Requirements

README.md Requirements

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generator	Notes
DALL-E	Use ChatGPT image generation (Plus recommended)
Bing Image Creator	Free standard generation available
Stable Diffusion	Local install OR Stability platform
Midjourney	Subscription required (try if access exists)
Ideogram	Limited free credits
FLUX	Available via Black Forest Labs
Nano Banana (Google)	Use Google AI Studio free trial (students)

Manually Generated Multi-Model Image Dataset (HF-Hosted) + Import Script #49

Description

Issue Type

Description

Acceptance Criteria

Deliverables

Hugging Face Dataset

Import Script

Additional Context

Prompt Source (MANDATORY)

Prompt Usage Rules

Required Directory Structure (Local + HF)

Prompts

Images (implicit metadata via paths)

Generators to Use

Generator Access Notes (MUST APPEAR IN README)

Image Saving Rules

Hugging Face Upload Strategy (IMPORTANT)

Implementation Notes

Import Script Requirements

README.md Requirements

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions