Skip to content

Research: AIGenBench Benchmark Analysis — Architecture, Dataset Overlap & Submission Strategy#67

Open
Vihaan-Singhal1 wants to merge 2 commits intodevelopmentfrom
research/aigenbench-analysis
Open

Research: AIGenBench Benchmark Analysis — Architecture, Dataset Overlap & Submission Strategy#67
Vihaan-Singhal1 wants to merge 2 commits intodevelopmentfrom
research/aigenbench-analysis

Conversation

@Vihaan-Singhal1
Copy link
Collaborator

Summary

Closes #65

This PR delivers docs/research/aigenbench_analysis.md — a research-paper-quality analysis of the AI-GenBench benchmark covering all three requirements from issue #65.

Contents

1. Architecture Analysis of Top-Performing Models

  • Benchmark results table for all evaluated baselines (ViT-L/14 DINOv2, ViT-L/14 CLIP, ResNet-50 CLIP) with AUROC and accuracy
  • Full mathematical treatment: multi-head self-attention, DINOv2 self-distillation loss (teacher-student), residual block formula
  • Key design choices from the companion paper (arXiv:2511.21507): why full fine-tuning beats linear probing, why resize beats multi-crop, why JPEG augmentation is the single most impactful factor
  • Comparison table of our current models vs AIGenBench SOTA with estimated generalization AUROC and gap analysis

2. Dataset Overlap Analysis (Critical Finding)

  • DRAGON directly consolidates Synthbuster and GenImage — both are AIGenBench source datasets
  • OpenFake contains SD 1.5/2.1/XL, DALL-E 3, and FLUX — overlapping AIGenBench windows w7–w8
  • WildFake's GAN content overlaps ForenSynths (windows w0–w3)
  • Result: contamination in 7 of 9 evaluation windows — current training data cannot be used for a valid leaderboard submission
  • Includes Jaccard overlap coefficients per window and a clean data decision matrix

3. Testing Strategy and Submission Process

  • Dataset access, environment setup, and PyTorch Lightning framework overview
  • Full DetermAugment pipeline specification (JPEG compression, blur, noise, resize)
  • Algorithm 1 temporal training pseudocode
  • Submission requirements (public codebase + arXiv report)
  • Step-by-step adaptation plan for integrating our models into the AIGenBench framework

Document Highlights

  • 13 tables covering generator taxonomy, benchmark results, overlap matrices, clean data strategy, performance projections
  • Math throughout: AUROC integral, temporal generalization score $\bar{G}$, self-attention, DINOv2 distillation loss, F1/Precision/Recall
  • ASCII visualizations: sliding window protocol diagram, dataset overlap coverage diagram
  • 13 references spanning the full literature from CycleGAN (2017) to DINOv2 (2024)

Actions Recommended

  1. Upgrade ViT-Base → ViT-L/14 DINOv2 for AIGenBench submission
  2. Add JPEG compression simulation (quality 50–99) to all training augmentation pipelines immediately
  3. Exclude DRAGON and overlap-contaminated OpenFake subsets before any AIGenBench evaluation

🤖 Generated with Claude Code

Vihaan-Singhal1 and others added 2 commits February 24, 2026 19:20
…erlap, submission strategy

Adds docs/research/aigenbench_analysis.md covering:
- Architecture analysis of top AIGenBench models (ViT-L/14 DINOv2, CLIP, ResNet-50 CLIP)
  with full math: attention, DINOv2 self-distillation loss, AUROC/F1 formulas
- Critical dataset contamination finding: DRAGON consolidates Synthbuster + GenImage
  (both AIGenBench sources); OpenFake/WildFake overlap 7 of 9 evaluation windows
- Clean data strategy for valid leaderboard submission
- Testing and submission protocol (PyTorch Lightning framework, DetermAugment pipeline,
  sliding window evaluation)
- Roadmap: upgrade to ViT-L/14 DINOv2, add JPEG compression augmentation, ensemble with FFT

Closes #65

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
24-page styled PDF generated from aigenbench_analysis.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Analyze AIGenBench: Top Models, Dataset Overlap, and Testing Strategy

1 participant