This repository presents research on evaluating 3D content creation from high-resolution text inputs using advanced techniques like CLIP and loss functions tailored for text-to-3D mesh generation.
- Introduction
- Background
- The Unsolved Problem of CLIP-Loss
- Methodology
- Dataset
- Experimental Design
- References
This project explores advancements in high-resolution text-to-3D mesh generation using OpenAI’s CLIP, differentiable rendering, and various optimization techniques. The focus is on overcoming challenges in current loss functions to improve the quality and fidelity of generated 3D models.
Generating high-resolution 3D meshes from text involves transforming textual descriptions (e.g., "Evergreen tree") into accurate and detailed 3D shapes. Challenges include the need for computational resources and reliance on 3D-text datasets. Recent advancements use zero-shot learning powered by tools like CLIP.
- Differentiable Rendering: Converts 3D models into 2D images for evaluation.
- CLIP-Based Optimization: Measures similarity between rendered images and text descriptions, driving iterative improvements.
Refines embeddings based on text-image pairs for improved consistency.
Maintains smooth shapes using techniques like Laplacian Regularization.
Optimizes textures to avoid artifacts using maps like normal maps.
Measures cosine similarity between text prompts and rendered images.
Encourages consistency across different angles and viewpoints.
CLIP's 2D training bias causes challenges in generating consistent 3D shapes. Without sufficient constraints, optimization can lead to tangled or noisy meshes. Techniques like regularization constraints and viewpoint augmentation help mitigate these issues but increase computational demands.
- Dream Fields and CLIP-Mesh: Evaluate single views at specific elevations.
- DreamFusion: Averages across azimuths to reduce variance.
- Generate four distinct views per mesh.
- Compute Shannon entropy for each view as a measure of informational complexity.
- Apply exponential weighting based on entropy using a hyperparameter
p.
Each 3D mesh is enriched with:
- Human-annotated quality scores (MOS).
- View-specific weights and R-Precision scores.
- Compute individual R-Precision scores for each view.
- Apply entropy-based weights.
- Generate a weighted mean R-Precision score.
The project uses the LIRIS lab's Graphics-LPIPS dataset:
- Over 343,000 stimuli from 55 source models.
- A subset of 3,000 stimuli with human-annotated quality scores (MOS).
Compute weighted CLIP R-Precision scores for various values of p.
Use 5-fold cross-validation to train a linear probe mapping R-Precision scores to MOS values.
Evaluate predicted MOS values against human-annotated ground truth using Mean Squared Error (MSE).
- Khalid et al., 2022: CLIP-Mesh
- Jain et al., 2022: Dream Fields
- Nehmé et al., 2023: Graphics-LPIPS Dataset
For more details, refer to the full research paper.