NOVAGEN is an end-to-end generative AI pipeline designed to discover novel, thermodynamically stable, and functionally specific semiconductor materials. By utilizing a "Curriculum Learning" approach, the system progresses from understanding basic crystallographic geometry to predicting complex quantum electronic properties.
The project is divided into three major stages: Generative Modeling, Multi-Phase Training, and Industrial Deployment. It bridges the gap between deep learning and materials engineering by ensuring all generated materials are chemically valid, physically stable, and synthesizable.
Powered by CrystalFormer, a causal (decoder-only) Transformer adapted for crystallographic data.
- Autoregressive Generation: Predicts elements, Wyckoff positions, and 3D fractional coordinates atom-by-atom.
- Symmetry-Aware: Enforces Space Group rules before any physics are applied.
- Lattice Bias Head: A learnable scalar parameter that dynamically optimizes the unit cell volume.
- Continuous Coordinates: Uses a Von Mises Mixture Density Network (MDN) to handle periodic boundary conditions.
The model is fine-tuned using Policy Gradient Reinforcement Learning (PPO) via the Adam Optimizer, progressing through three distinct phases:
- Phase I: Geometric Stabilization (Spatial): Penalizes atomic overlaps and singularities. Teaches the model how to pack atoms into a solid.
- Phase II: Thermodynamic Physics (Stability): Minimizes the "Free Energy" of the crystal using the CHGNet Graph Neural Network (GNN).
- Phase III: Functional Properties (Lab-Grade): Targets specific electronic properties, such as a 2.8 eV Band Gap, using the MEGNet Oracle.
High-throughput batch generation of crystal candidates, followed by rigorous deep-relaxation validation to output lab-ready .cif files.
config.yaml: The architectural blueprint defining Transformer layers, attention heads, embedding sizes, and vocabulary limits.generator_service.py: Initializes the Transformer instance from the config and injects the pre-trained state dictionary (.ptcheckpoint) for inference and training.
train_phase1_spatial.py: RL script for geometric stability. Uses low learning rates and gradient accumulation.train_phase2_physics.py: RL script integrating CHGNet to reward low-energy states. Includes active teaching filters (e.g., penalties for >40 atoms).train_phase3_properties.py: RL script targeting specific band gaps using the CPU-bound MEGNet Oracle.
sentinel.py/reward_phase1.py: The "Geometric Bouncer" that quickly rejects physically impossible structures.product_relaxer.py: Wraps CHGNet to perform GPU-accelerated atomic relaxation and force prediction.product_oracle.py: Wraps MEGNet to predict final electronic properties (Band Gap) in milliseconds.
generate_crystals.py: The high-throughput "Factory" script. Features memory purging, timeout fail-safes, and fast-density filtering to mass-produce 5,000+ candidate crystals overnight.final_relaxation.py: The "QA Lab." Subjects surviving candidates to 500-step deep relaxation and re-verifies band gaps to confirm ground-state stability.