[Paper] [Dataset] [Model] [Colab]
Sophon (Signature-Oriented Pre-training for Heavy-resonance ObservatioN) is a method proposed for developing foundation AI models tailored for future usage in LHC experimental analyses. The approach focuses on pre-training a model using a comprehensive jet dataset designed to capture extensive jet signatures.
This work introduces:
- The JetClass-II dataset: a large-scale and comprehensive large-R jet dataset.
- The Sophon model: a Particle Transformer model pre-trained on a 188-class classification task utilizing the JetClass-II dataset.
Further details are provided below.
JetClass-II is a large-scale and comprehensive dataset covering extensive large-radius jet signatures and a wide range of jet
The dataset consists of three major parts:
-
Res2P
: Generic$X \to$ 2 prong resonant jets. -
Res34P
: Generic$X \to$ 3 or 4 prong resonant jets. -
QCD
: Jets from QCD multijet background.
Each part is further subdivided into detailed categories, indicating which partons, leptons, or combinations thereof initiated the jet.
The dataset can be downloaded from [HuggingFace]. The three major parts (Res2P
, Res34P
, and QCD
) are separately packed and can be downloaded individually for ease of use. The sizes of the training sets are 20M, 86M, and 28M entries, respectively. The dataset also includes validation and test sets, with the sizes for training/validation/test following a 4:1:1 ratio.
Every 100k entries (jets) are stored in a Parquet file. A complete view of the JetClass-II data files are shown in the table below.
Type | File name range | File number | total entries |
---|---|---|---|
Res2P , train |
Res2P_0000.parquet —Res2P_0199.parquet |
200 | 20M |
Res2P , val |
Res2P_0200.parquet —Res2P_0249.parquet |
50 | 5M |
Res2P , test |
Res2P_0250.parquet —Res2P_0299.parquet |
50 | 5M |
Res34P , train |
Res34P_0000.parquet —Res34P_0859.parquet |
860 | 86M |
Res34P , val |
Res34P_0860.parquet —Res34P_1074.parquet |
215 | 21.5M |
Res34P , test |
Res34P_1075.parquet —Res34P_1289.parquet |
215 | 21.5M |
QCD , train |
QCD_0000.parquet —QCD_0279.parquet |
280 | 28M |
QCD , val |
QCD_0280.parquet —QCD_0349.parquet |
70 | 7M |
QCD , test |
QCD_0350.parquet —QCD_0419.parquet |
70 | 7M |
Use [Colab] to inspect and visualize data in JetClass-II.
Here are some visualizations of jets marked with the top-5 probability scores interpreted by the Sophon model (see the Sophon model's section below).
The dataset is generated using MadGraph + Pythia + Delphes.
During the Delphes (fast simulation) step, the pileup (PU) effect, with an average of 50 PU interactions, is emulated to mimic the realistic LHC collision environment. The PUPPI algorithm is then applied to remove the PU, correcting the E-flow objects used to cluster jets. This distinguishes it from the original JetClass dataset. The Delphes card can be found in the jetclass2-generation
repository.
The complete generation script (the one-stop MadGraph + Pythia + Delphes production) and the n-tuplizer script are provided in the jetclass2-generation
repository to facilitate reproducibility.
The JetClass-II dataset includes the following variables:
-
part_*
: Features for jet constituent particles (i.e., E-flow objects in Delphes). -
jet_*
: Features for jets. A specific variable isjet_label
, which indicates the label in 188 classes. -
genpart_*
: Features for generator-level jet (GEN-jet) constituent particles. The GEN-jet is clustered from the stable particles generated by Pythia, excluding neutrinos, using the same clustering configuration. The GEN-jets are matched with jets based on angular separation. The entry is left empty if no matched GEN-jet is found. -
genjet_*
: Jet-level features for the matched GEN-jet. -
aux_genpart_*
: Auxiliary variables storing features of selected truth particles. Five types of particles are chosen if they are valid:- The initial resonance
$X$ (in both 2-prong and 3/4-prong resonance cases). - The two secondary resonances
$Y$ produced by$X$ ($X \to Y_1Y_2$ ) in the 3/4-prong resonance case. - The direct decay products (partons and leptons) from
$X$ and$Y$ . - The subsequent decay products of tau leptons in case (iii).
- The partons (
$p_{\rm T}$ > 5 GeV) matched within a QCD jet.
- The initial resonance
**Expand to see detailed descriptions for JetClass-II variables and a comparison with JetClass variables.**
Variable | Type | Description | Exists in JetClass? |
---|---|---|---|
For jet constituent particles | |||
part_px |
vector<float> | particle's |
✔️ |
part_py |
vector<float> | particle's |
✔️ |
part_pz |
vector<float> | particle's |
✔️ |
part_energy |
vector<float> | particle's energy | ✔️ |
part_deta |
vector<float> | difference in pseudorapidity |
✔️ |
part_dphi |
vector<float> | difference in azimuthal angle |
✔️ |
part_d0val |
vector<float> | particle's transverse impact parameter value |
✔️ |
part_d0err |
vector<float> | error of the particle's transverse impact parameter |
✔️ |
part_dzval |
vector<float> | particle's longitudinal impact parameter value |
✔️ |
part_dzerr |
vector<float> | error of the particle's longitudinal impact parameter |
✔️ |
part_charge |
vector<int32_t> | particle's electric charge | ✔️ |
part_isElectron |
vector<bool> | if the particle is an electron (abs(pid)==11 ) |
✔️ |
part_isMuon |
vector<bool> | if the particle is an muon (abs(pid)==13 ) |
✔️ |
part_isPhoton |
vector<bool> | if the particle is an photon (pid==22 ) |
✔️ |
part_isChargedHadron |
vector<bool> | if the particle is a charged hadron (charge!=0 && !isElectron && !isMuon ) |
✔️ |
part_isNeutralHadron |
vector<bool> | if the particle is a neutral hadron (charge==0 && !isPhoton ) |
✔️ |
For jet | |||
jet_pt |
float | jet's transverse momentum |
✔️ |
jet_eta |
float | jet's pseudorapidity |
✔️ |
jet_phi |
float | jet's azimuthal angle |
✔️ |
jet_energy |
float | jet's energy | ✔️ |
jet_sdmass |
float | jet's soft-drop mass | ✔️ |
jet_nparticles |
int32_t | number of jet constituent particles | ✔️ |
jet_tau1 |
float | jet's |
✔️ |
jet_tau2 |
float | jet's |
✔️ |
jet_tau3 |
float | jet's |
✔️ |
jet_tau4 |
float | jet's |
✔️ |
jet_label |
int32_t | jet's label index in JetClass-II, detailed in the above table | 🆕 |
For GEN-jet constituent particles (if a GEN-jet is found matched to a jet) | |||
genpart_px |
vector<float> | particle's |
🆕 |
genpart_py |
vector<float> | particle's |
🆕 |
genpart_pz |
vector<float> | particle's |
🆕 |
genpart_energy |
vector<float> | particle's energy | 🆕 |
genpart_jet_deta |
vector<float> | difference in pseudorapidity |
🆕 |
genpart_jet_dphi |
vector<float> | difference in azimuthal angle |
🆕 |
genpart_x |
vector<float> |
|
🆕 |
genpart_y |
vector<float> |
|
🆕 |
genpart_z |
vector<float> |
|
🆕 |
genpart_t |
vector<float> |
|
🆕 |
genpart_pid |
vector<int32_t> | particle's PDGID | 🆕 |
For GEN-jet (if matched to a jet) | |||
genjet_pt |
float | GEN-jet's transverse momentum |
🆕 |
genjet_eta |
float | GEN-jet's pseudorapidity |
🆕 |
genjet_phi |
float | GEN-jet's azimuthal angle |
🆕 |
genjet_energy |
float | GEN-jet's energy | 🆕 |
genjet_sdmass |
float | GEN-jet's soft-drop mass | 🆕 |
genjet_nparticles |
int32_t | number of GEN-jet constituent particles | 🆕 |
For selected truth particles | |||
aux_genpart_pt |
vector<float> | selected truth particles' |
✔️ (different rules to select truth particles) |
aux_genpart_eta |
vector<float> | selected truth particles' |
✔️ (different rules to select truth particles) |
aux_genpart_phi |
vector<float> | selected truth particles' |
✔️ (different rules to select truth particles) |
aux_genpart_mass |
vector<float> | selected truth particles' mass | ✔️ (different rules to select truth particles) |
aux_genpart_pid |
vector<int32_t> | selected truth particles' PDGID | 🆕 |
aux_genpart_isResX |
vector<bool> | if the particle is the initial resonance |
🆕 |
aux_genpart_isResY |
vector<bool> | if the particle is the secondary resonance |
🆕 |
aux_genpart_isResDecayProd |
vector<bool> | if the particle is the direct decay product (parton and lepton) from |
🆕 |
aux_genpart_isTauDecayProd |
vector<bool> | if the particle is the subsequent decay product of tau leptons | 🆕 |
aux_genpart_isQcdParton |
vector<bool> | if the particle is the parton with |
🆕 |
The Sophon model is based on the ParT architecture. It is implemented in PyTorch, with training based on the weaver framework for dataset loading and transformation. To install weaver
, run:
pip install git+https://github.com/hqucms/weaver-core.git@dev/custom_train_eval
Note: We are temporarily using a development branch of
weaver
.
For instructions on setting up Miniconda and installing PyTorch, refer to the weaver
page.
Download the JetClass-II dataset from [HuggingFace]. The training and validation files are used in this work, while the test files are not used.
Ensure that all data files are accessible from:
./datasets/JetClassII/Pythia/{Res2P,Res34P,QCD}_*.parquet
Step 1: Generate dataset sampling weights according to the weights
section in the data configuration. The processed config with pre-calculated weights will be saved to data/JetClassII
.
./train_sophon.sh make_weight
Step 2: Start training.
./train_sophon.sh train
Note: Depending on your machine and GPU configuration, additional settings may be useful. Here are a few examples:
- Enable PyTorch's DDP for parallel training, e.g.,
CUDA_VISIBLE_DEVICES=0,1,2,3 DDP_NGPUS=4 ./train_sophon.sh train --start-lr 2e-3
(the learning rate should be scaled according toDDP_NGPUS
).- Configure the number of data loader workers and the number of splits for the entire dataset. The script uses the default configuration
--num-workers 5 --data-split-num 200
, which means there are 5 workers, each responsible for processing 1/5 of the data files and reading the data synchronously; the data assigned to each worker is split into 200 parts, with each worker sequentially reading 1/200 of the total data in order.
Step 3 (optional): Convert the model to ONNX.
./train_sophon.sh convert
We introduce two methods for inferring the Sophon model: using Python and C++ (with C++ macros for analyzing Delphes files).
Please refer to our Jupyter notebook example on [Colab] for detailed instructions. See the section "Inferring Sophon model" for more information.
For details on using the C++ workflow, please see the ./analyzers
directory.
If you use the JetClass-II dataset or the Sophon model, please cite:
@article{Li:2024htp,
author = "Li, Congqiao and Agapitos, Antonios and Drews, Jovin and Duarte, Javier and Fu, Dawei and Gao, Leyun and Kansal, Raghav and Kasieczka, Gregor and Moureaux, Louis and Qu, Huilin and Suarez, Cristina Mantilla and Li, Qiang",
title = "{Accelerating Resonance Searches via Signature-Oriented Pre-training}",
eprint = "2405.12972",
archivePrefix = "arXiv",
primaryClass = "hep-ph",
month = "5",
year = "2024"
}