-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Issue Type
- Model: ML model bug, training issue, or architecture problem
- Data: Dataset issue, preprocessing bug, or data pipeline problem
- Web: Frontend bug or UI issue in the Next.js dashboard
- API: Backend API bug or FastAPI endpoint issue
- Research: Research question or experimental feature request
- Documentation: Documentation bug or improvement needed
- Bug: General bug fix needed
- Enhancement: New feature or improvement request
Description
Create a manually generated, prompt-aligned dataset of AI-generated images using multiple image generation models (DALL-E, Midjourney, Stable Diffusion, etc.), upload it to Hugging Face Datasets as a raw file repository, and provide a local import script that pulls and parses the dataset into a usable directory structure for training and evaluation.
This ticket intentionally separates:
- Data creation & hosting (manual, simple)
- Data consumption (scripted, flexible)
Acceptance Criteria
- Prompts generated by running
scripts/create_manual_prompts.pyon branchdata/manual-gen - Images generated for as many generators as possible (based on usage limits and free trials)
- Directory structure exactly matches spec
- Dataset uploaded to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images
-
README.mdpresent and complete -
scripts/import_manual_dataset.pyexists and runs - Script successfully downloads and parses dataset locally
Deliverables
Hugging Face Dataset
Upload dataset to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images
This dataset is hosted as raw files (images + prompts), NOT as a pre-built datasets.Dataset object.
Import Script
Create a script at: scripts/import_manual_dataset.py
This script pulls the dataset from Hugging Face and reconstructs a local, training-ready view of the dataset.
Additional Context
Prompt Source (MANDATORY)
All suggested prompts already exist, try not to change them unless necessary to make the image realistic.
To generate the prompts:
- Checkout the branch:
data/manual-gen - Run:
scripts/create_manual_prompts.py - This will create prompts at:
dataset/manual-gen-images/prompts/
Files:
p001.txt→p100.txt
Each file contains one complete prompt.
Prompt Usage Rules
- Use the prompt text from
prompts/pXXX.txt - Generate one image per prompt per generator
- Try not to embellish, rewrite, or "optimize" prompts to help realism
- If a generator rejects a prompt:
- Make the smallest possible policy-safe edit
- Keep the same
prompt_id - Document the change in
README.md
- IMPORTANT: If any changes are made to any prompts for specific generators, note the change in the
pXXX.txtfile for that prompt as:
*Original Prompt*
*Generator*: *Adjusted Prompt*
Required Directory Structure (Local + HF)
Dataset root: manual-gen-images/
Prompts
prompts/
├── p001.txt
├── p002.txt
├── ...
└── p100.txt
Images (implicit metadata via paths)
images/
├── dalle/
│ ├── p001.png
│ ├── p002.png
│ └── ...
├── midjourney/
│ ├── p001.png
│ └── ...
├── bing/
│ ├── p001.png
│ └── ...
├── stable_diffusion/
│ ├── p001.png
│ └── ...
├── ideogram/
│ ├── p001.png
│ └── ...
├── flux/
│ ├── p001.png
│ └── ...
└── nanobanana/
├── p001.png
└── ...
From this structure, metadata is inferred as:
generator= parent directory nameprompt_id= filename (e.g.,p042)prompt_text= contents ofprompts/p042.txt
Generators to Use
Attempt as many as possible from the list below. Partial completion is acceptable.
| Generator | Notes |
|---|---|
| DALL-E | Use ChatGPT image generation (Plus recommended) |
| Bing Image Creator | Free standard generation available |
| Stable Diffusion | Local install OR Stability platform |
| Midjourney | Subscription required (try if access exists) |
| Ideogram | Limited free credits |
| FLUX | Available via Black Forest Labs |
| Nano Banana (Google) | Use Google AI Studio free trial (students) |
Generator Access Notes (MUST APPEAR IN README)
Access Notes
- If you do not have ChatGPT Plus access for DALL-E, notify
lukhsaaankumar.- Nano Banana (Google DeepMind) can be accessed via Google AI Studio free trial (student accounts supported).
- Midjourney availability varies; free trials may or may not be active.
- If a generator cannot be accessed, skip it and document the reason in the dataset's
README.md.
Image Saving Rules
- Save images as PNG
- Filename format:
pXXX.png - No upscaling, cropping, filters, or post-processing
- No watermarks
Hugging Face Upload Strategy (IMPORTANT)
- Upload the dataset as raw files (folder structure preserved)
- Do NOT convert into a structured HF
Datasetobject in this ticket - Hugging Face will host the dataset as a file-based repo
Implementation Notes
Import Script Requirements
Script location: scripts/import_manual_dataset.py
Purpose: This script pulls the dataset from Hugging Face and reconstructs a local, training-ready view of the dataset.
Script responsibilities:
-
Download the dataset snapshot from Hugging Face:
- repo:
DeepFakeDetector/manual-gen-images - repo_type:
dataset
- repo:
-
Parse the directory structure:
- Iterate over
images/{generator}/pXXX.png - Read corresponding
prompts/pXXX.txt
- Iterate over
-
Expose a clean Python representation, e.g.:
{
"image_path": "...",
"generator": "dalle",
"prompt_id": "p042",
"prompt_text": "...",
}README.md Requirements
The dataset README must include:
- Dataset description
- Prompt-aligned design explanation
- Directory structure
- Generator list
- Access notes
- Limitations (manual generation, policy constraints)
Definition of Done
- Prompts generated by running
scripts/create_manual_prompts.pyon branchdata/manual-gen - Images generated for as many generators as possible
- Directory structure exactly matches spec
- Dataset uploaded to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images
-
README.mdpresent and complete -
scripts/import_manual_dataset.pyexists and runs - Script successfully downloads and parses dataset locally