We present ORACLE, the first hierarchical deep-learning model for real-time, context-aware classification of transient and variable astrophysical phenomena. ORACLE is a recurrent neural network with Gated Recurrent Units (GRUs), and has been trained using a custom hierarchical cross-entropy loss function to provide high-confidence classifications along an observationally-driven taxonomy with as little as a single photometric observation. Contextual information for each object, including host galaxy photometric redshift, offset, ellipticity and brightness, is concatenated to the light curve embedding and used to make a final prediction.
For more information, please read the our paper - https://ui.adsabs.harvard.edu/abs/2025arXiv250101496S/abstract
If you use any of this code in your own work, please cite the associated paper and software using the following
@ARTICLE{oracle1,
author = {{Shah}, Ved G. and {Gagliano}, Alex and {Malanchev}, Konstantin and {Narayan}, Gautham and {The LSST Dark Energy Science Collaboration}},
title = "{ORACLE: A Real-Time, Hierarchical, Deep-Learning Photometric Classifier for the LSST}",
journal = {arXiv e-prints},
keywords = {Astrophysics - Instrumentation and Methods for Astrophysics, Astrophysics - High Energy Astrophysical Phenomena, Computer Science - Artificial Intelligence, Computer Science - Machine Learning},
year = 2025,
month = jan,
eid = {arXiv:2501.01496},
pages = {arXiv:2501.01496},
doi = {10.48550/arXiv.2501.01496},
archivePrefix = {arXiv},
eprint = {2501.01496},
primaryClass = {astro-ph.IM},
adsurl = {https://ui.adsabs.harvard.edu/abs/2025arXiv250101496S},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
@software{oracle1-software,
author = {Shah, Ved and
Gagliano, Alexander and
Malanchev, Konstantin and
Narayan, Gautham and
Malz, A.I. and
The LSST Dark Energy Science Collaboration},
title = {ORACLE: A Real-Time, Hierarchical, Deep-Learning
Photometric Classifier for the LSST
},
month = mar,
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.15099699},
url = {https://doi.org/10.5281/zenodo.15099699},
swhid = {swh:1:dir:731eb324342ca22b473e5ff6710483c721d417aa
;origin=https://doi.org/10.5281/zenodo.15099698;vi
sit=swh:1:snp:8fc4b2c5a108b4542b5133dcfcd1dc2d7087
2fca;anchor=swh:1:rel:4aa4f29a46142866d1d3d0182950
4631c8b02696;path=ELAsTiCC-Classification-main
},
}
Oracle is a pip installable package and was developed on python 3.10.16
. I recommend creating a new environment for every project. If you are using conda, you can do this using
conda create -n <ENV_NAME> python=3.10
Next, you can install oracle using
pip install git+https://github.com/uiucsn/Astro-ORACLE.git
This should set up oracle and related dependencies.
If you only care about using ORACLE to classify your own light curves, please check notebooks/tutorial.ipynb
for an example.
Once the package has been installed, large parts of the functionality is exposed to the user through CLI commands. Specifically:
- oracle-train - Can be used to train new models
- oracle-test - Can be used to test the models
- oracle-runAnalysis - Can be used to generate summaries of the models performance
- oracle-classSummaries = Can be used to summarize the training/testing data sets
- oracle-combineParquet - Can be used to combine parquet files of different classes
- oracle-fitsToParquet - Can be used to convert the SNANA fits to parquet files
- oracle-prepArrays - Can be used to prepare the array for training/testing the models
- oracle-timeBenchmark - Can be used to benchmark the inference performance of ORACLE
- fits_to_parquet.py - Convert SNANA fits files to parquet files
- combine_parquet.py - Combine the parquet files into a training and testing set.
- prep_array.py - Convert the test/train parquet files into pickle objects that can be ingested directly.
- LSST_Source.py - Class for storing relevant data from the parquet files. Has additional functionality for data augmentation, flux curve plotting etc.
- RNN_model.py - Class for the RNN classifier
- train_RNN.py - Script for training the RNN classifier
- test_RNN.py - Script for testing the RNN classifier
- class_summaries.py - Summarize the number of objects in each class and the length of the TS data for each of those objects
- dataloader.py - Convert the parquet rows to tensors. Augment the data with padding/truncation and transforms if necessary
- loss.py - Loss function for hierarchical classification
- taxonomy.py - Utility functions for the taxonomy used for this work
The ELAsTiCC 2 training data set contains 32 different classes of astrophysical objects. Each object has 80 FITS files - 40 which contain the photometry and the remaining contain other information like the host galaxy properties.
We care about having all the information (i.e. light curves + host galaxy) information for each object in a convenient format before we start any data augmentation. For this reason, we bind the HEAD and PHOT FITS files, extract the relevant information and store it as parquets.
The code used for this conversion is in fits_to_parquet.py
. This code is modified from an earlier version written by Kostya here.
- Download the data with this link
- Unpack it to
data/data/elasticc2_train/raw
cd data/data/elasticc2_train
- Convert all the data to parquet:
ls raw | sed 's/ELASTICC2_TRAIN_02_//' | xargs -IXXX -P32 python3 ../../../fits_to_parquet.py raw/ELASTICC2_TRAIN_02_XXX parquet/XXX.parquet
- Next, we combine the per class parquet files into a train parquet file and a test parquet file.
- Finally, we convert these to arrays which we pickle and ingest into the classifier.
You can find much more detailed information about the whole process in the paper(linked above).
There is no universally correct classification taxonomy - however we want to build something that is able to best serve real world science cases. For obvious reasons, the leaf nodes need to be the true class of the object however what we decide for nodes higher up in the taxonomy is ultimately determined by the science case.
For this work, we are implementing a hierarchy that will be of interest to the TVS (Transient and Variable star) community since it overlaps well with the classes of the elasticc data set. The exact taxonomy used is shown below:
A trap we wanted to avoid was mixing different "metaphors" for classification. For instance, we decided against using Galactic vs Extra galactic
classification since we would be mixing a spatial distinction with temporal ones (like Periodic vs Non periodic
). This makes the problem trickier since some objects, like Cepheids, can be both galactic and extragalactic which would result in an artificial inflation in the number of leaf nodes without adding much value to the science case.
Note: There is some inconsistency around the Mira/LPV object in the elasticc documentation however we have confirmed that this object was removed for the elasticc2 data set that we use in this work.
Once again, we have much more detailed discussion in the paper.
The loss function defines what the machine learning model(s) are incentivized to learn. In our case, we want the activations in our final layer to represent some (pseudo) probabilities for the classes in our taxonomy. The first loss function that we experiment with is the Villar et al. Hierarchical Cross-entropy Loss for Classification
. The paper related to this loss function is available on arXiv.
We would appreciate any support in the form of bug reports in order to provide the best possible experience. Bugs can be reported in the Issues
tab.
If you like what we're doing and want to contribute to the project, please start by reaching out to the developer. We are happy to accept suggestions in the Issues tab.
Overall model performance:
First at the root,
At the next level in the hierarchy
And finally, at the leaf...