Skip to content

Manually Generated Multi-Model Image Dataset (HF-Hosted) + Import Script #49

@lukhsaankumar

Description

@lukhsaankumar

Issue Type

  • Model: ML model bug, training issue, or architecture problem
  • Data: Dataset issue, preprocessing bug, or data pipeline problem
  • Web: Frontend bug or UI issue in the Next.js dashboard
  • API: Backend API bug or FastAPI endpoint issue
  • Research: Research question or experimental feature request
  • Documentation: Documentation bug or improvement needed
  • Bug: General bug fix needed
  • Enhancement: New feature or improvement request

Description

Create a manually generated, prompt-aligned dataset of AI-generated images using multiple image generation models (DALL-E, Midjourney, Stable Diffusion, etc.), upload it to Hugging Face Datasets as a raw file repository, and provide a local import script that pulls and parses the dataset into a usable directory structure for training and evaluation.

This ticket intentionally separates:

  • Data creation & hosting (manual, simple)
  • Data consumption (scripted, flexible)

Acceptance Criteria

  • Prompts generated by running scripts/create_manual_prompts.py on branch data/manual-gen
  • Images generated for as many generators as possible (based on usage limits and free trials)
  • Directory structure exactly matches spec
  • Dataset uploaded to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images
  • README.md present and complete
  • scripts/import_manual_dataset.py exists and runs
  • Script successfully downloads and parses dataset locally

Deliverables

Hugging Face Dataset

Upload dataset to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images

This dataset is hosted as raw files (images + prompts), NOT as a pre-built datasets.Dataset object.

Import Script

Create a script at: scripts/import_manual_dataset.py

This script pulls the dataset from Hugging Face and reconstructs a local, training-ready view of the dataset.

Additional Context

Prompt Source (MANDATORY)

All suggested prompts already exist, try not to change them unless necessary to make the image realistic.

To generate the prompts:

  1. Checkout the branch: data/manual-gen
  2. Run: scripts/create_manual_prompts.py
  3. This will create prompts at: dataset/manual-gen-images/prompts/

Files:

  • p001.txtp100.txt

Each file contains one complete prompt.

Prompt Usage Rules

  • Use the prompt text from prompts/pXXX.txt
  • Generate one image per prompt per generator
  • Try not to embellish, rewrite, or "optimize" prompts to help realism
  • If a generator rejects a prompt:
    • Make the smallest possible policy-safe edit
    • Keep the same prompt_id
    • Document the change in README.md
  • IMPORTANT: If any changes are made to any prompts for specific generators, note the change in the pXXX.txt file for that prompt as:
*Original Prompt*
*Generator*: *Adjusted Prompt*

Required Directory Structure (Local + HF)

Dataset root: manual-gen-images/

Prompts

prompts/
├── p001.txt
├── p002.txt
├── ...
└── p100.txt

Images (implicit metadata via paths)

images/
├── dalle/
│   ├── p001.png
│   ├── p002.png
│   └── ...
├── midjourney/
│   ├── p001.png
│   └── ...
├── bing/
│   ├── p001.png
│   └── ...
├── stable_diffusion/
│   ├── p001.png
│   └── ...
├── ideogram/
│   ├── p001.png
│   └── ...
├── flux/
│   ├── p001.png
│   └── ...
└── nanobanana/
    ├── p001.png
    └── ...

From this structure, metadata is inferred as:

  • generator = parent directory name
  • prompt_id = filename (e.g., p042)
  • prompt_text = contents of prompts/p042.txt

Generators to Use

Attempt as many as possible from the list below. Partial completion is acceptable.

Generator Notes
DALL-E Use ChatGPT image generation (Plus recommended)
Bing Image Creator Free standard generation available
Stable Diffusion Local install OR Stability platform
Midjourney Subscription required (try if access exists)
Ideogram Limited free credits
FLUX Available via Black Forest Labs
Nano Banana (Google) Use Google AI Studio free trial (students)

Generator Access Notes (MUST APPEAR IN README)

Access Notes

  • If you do not have ChatGPT Plus access for DALL-E, notify lukhsaaankumar.
  • Nano Banana (Google DeepMind) can be accessed via Google AI Studio free trial (student accounts supported).
  • Midjourney availability varies; free trials may or may not be active.
  • If a generator cannot be accessed, skip it and document the reason in the dataset's README.md.

Image Saving Rules

  • Save images as PNG
  • Filename format: pXXX.png
  • No upscaling, cropping, filters, or post-processing
  • No watermarks

Hugging Face Upload Strategy (IMPORTANT)

  • Upload the dataset as raw files (folder structure preserved)
  • Do NOT convert into a structured HF Dataset object in this ticket
  • Hugging Face will host the dataset as a file-based repo

Implementation Notes

Import Script Requirements

Script location: scripts/import_manual_dataset.py

Purpose: This script pulls the dataset from Hugging Face and reconstructs a local, training-ready view of the dataset.

Script responsibilities:

  1. Download the dataset snapshot from Hugging Face:

    • repo: DeepFakeDetector/manual-gen-images
    • repo_type: dataset
  2. Parse the directory structure:

    • Iterate over images/{generator}/pXXX.png
    • Read corresponding prompts/pXXX.txt
  3. Expose a clean Python representation, e.g.:

   {
     "image_path": "...",
     "generator": "dalle",
     "prompt_id": "p042",
     "prompt_text": "...",
   }

README.md Requirements

The dataset README must include:

  • Dataset description
  • Prompt-aligned design explanation
  • Directory structure
  • Generator list
  • Access notes
  • Limitations (manual generation, policy constraints)

Definition of Done

  • Prompts generated by running scripts/create_manual_prompts.py on branch data/manual-gen
  • Images generated for as many generators as possible
  • Directory structure exactly matches spec
  • Dataset uploaded to: https://huggingface.co/datasets/DeepFakeDetector/manual-gen-images
  • README.md present and complete
  • scripts/import_manual_dataset.py exists and runs
  • Script successfully downloads and parses dataset locally

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions