This repository is an implementation of ML-based jet-flavor tagging for full simulation using the key4hep framework.
The default tagger in this repo is trained on
- CLD full simulation (CLD_o2_v05)
-
$\sqrt{s} = 240$ GeV -
$Z(\nu \nu)H(jj)$ events
The tagger distinguishes seven flavors (
We build a Gaudi Transformer JetTagger
(in k4MLJetTagger/k4MLJetTagger/src/components/JetTagger.cpp
) that works as follows:
- First, it extracts jet constituent variables (such as kinematics, track parameters, PID...) from every jet in the event with
JetObservablesRetriever
. - Then, it uses these variables as input to a neural network (Particle Transformer). Here, we run inference on an ONNX exported trained network on 2 million jets/flavor using weaver. The code is in
WeaverInterface
andONNXRuntime
. - Create
$N$ (here: 7) new collectionsRefinedJetTag_X
that saves the probability for each flavor.
This code base also allows you to
- extract the MC jet flavor assuming H(jj)Z(vv) events by checking the PDG of the daughter particles of the Higgs Boson created. This is also implemented as a Gaudi Transformer
JetMCTagger
. - write the jet constituent observables used for tagging into a root file (e.g., for retraining a model) using
JetObsWriter
that accesses the observables retrieved inJetObservablesRetriever
. - write the jet tags (MC and reco) into a root file (e.g., to create ROC curves) using
JetTagWriter
.
-
ROOT
-
PODIO
-
Gaudi
-
k4FWCore
Within the k4MLJetTagger
directory run:
cd /your/path/to/this/repo/k4MLJetTagger/
source ./setup.sh
k4_local_repo
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=../install -G Ninja -DPython_EXECUTABLE=$(which python3)
ninja install
Run the tagger by including the transformer JetTagger
in a steering file like createJetTags.py
and run it like this:
k4run ../k4MLJetTagger/options/createJetTags.py --num_ev 20
This will return your edm4hep input file with the added RefinedJetTag_X
collections.
There are four steering files provided in this repo in /k4MLJetTagger/k4MLJetTagger/options/
. They either start with create
, which refers to a steering file that will append a new collection to the input edm4hep files provided, or they start with write
and only produce root files as an output.
createJetTags.py
: tags every jet using ML and appends 7 new PID collectionsRefinedJetTag_X
withX
being the 7 flavors (U, D, S, C, B, G, TAU).createJetMCTag.py
: appends one PID collection,MCJetTag
that refers to the MC jet flavor. Warning: This assumes H(jj)Z(vv) events as it checks the PDG of the daughter particles of the Higgs Boson in the event.writeJetConstObs.py
: creates a root file with jet constituent observables that can be used to train a model or plot the input parameters to the network for insights about the data.writeJetTags.py
: creates a root file with reco and MC jet tags that can be used to create ROC curves.
If you want to use the Jet Tagger for your analyses, you most likely only need to use a setup like in the steering file createJetTags.py
, which includes the tagger in your steering file. As being said, this will attach 7 new PID collections to your edm4hep input file. You can then access the PID collections with a PIDHandler like being done in JetTagWriter
(check it out).
Every jet has one PID collection for every flavor. Each jet has a certain probability associated for being of each flavor. E.g., if you want to use all jets that have at least a probability
// loop over all jets and get the PID likelihood of being a b-jet
for (const auto jet : jet_coll) {
auto jetTags_B = jetTag_B_Handler.getPIDs(jet);
score_recojet_isB = jetTags_B[0].getLikelihood() // [0] because there should only be ONE b-tag PID collection associated to one jet
if(score_recojet_isB > X){
// your code here
}
}
Here is a quick overview of the source files in this repo:
Gaudi Transformer:
JetTagger.cpp
: Gaudi Transformer to attach jet tags (7) as PID collections to the input edm4hep fileJetMCTagger.cpp
: Gaudi Transformer to attach jet MC tag as PID collection to the input edm4hep file Gaudi Algorithms:JetTagWriter
: Gaudi Algorithm to write reco and MC jet tags into a root fileJetObsWriter
: Gaudi Algorithm to write jet constituent observables into a root file Other C++ Helpers:JetObservablesRetriever
: Defines a class to retrieve jet constituent observables from the jet collection and vertex collection.ONNXRuntime
: Interacts with ONNX model for inference.WeaverInterface
: Wrapper around ONNXRuntime to match the expected format from training the network withweaver
.Structs.h
: Defines the structsPfcand
for saving information about the jet constituents, the structHelix
for saving track parameters, and the structJet
, which is a vector ofPfcand
.Helpers
: Other helpers
- If you want to run at a different energy.
- If you want to use different input observables for tagging (check out this section).
- If you want to use a different detector setup.
So, generally speaking, if the input to the network changes. This network implemented was trained on CLD full simulation at 240 GeV (/eos/experiment/fcc/prod/fcc/ee/test_spring2024/240gev/Hbb/CLD_o2_v05/rec/
). Check out the performance in this publication.
- Do some changes that require retraining.
- Create a dataset to train the model on. The data should be stored in a root file, and you can use the
JetObsWriter
to create them. You normally need ~1-2 mio jets/flavor to train a new model; that is why I recommend creating the root file using condor. Follow the instructions here to do so. - I recommend using weaver to retrain a model and to make use of Dolores implementation of the ParticleTransformer using her repository/version of weaver. The branch
Saras-dev
also has the implementation of the L-GATr network, which does not outperform the ParticleTransformer, however. Here is how to set up the training with weaver:
- Clone weaver.
- Clone your choice of network e.g., the ParT.
- Set up the environment using a docker. First, create the
.sif
file withsingularity pull docker://dologarcia/gatr:v0
(or:v
for L-GATr). You only need to do this once. Then export the cache withexport APPTAINER_CACHEDIR=/your/path/to/cache
. You can then activate the env withsingularity shell -B /your/bindings/ --nv colorsinglet.sif
. Bindings could be-B /eos -B /afs
if running on CERN resources. - Create a
.yaml
file to specify how to train the network. To use the jet observable convention used here, I have created a dummy config file in theextras
section:config_for_weaver_training.yaml
. Please check out the open issues to adapt the convention in the source code, but I highly recommend using the convention used in key4hep / here / in the dummy config when retraining the model and make the changes in the source code. - Create a wandb account. You need to connect your docker env once with your wandb account: Activate your env with
singularity shell -B /your/bindings/ --nv colorsinglet.sif
and then dowandb login
.
- Run the training
- Go to a machine of your choice with at least one GPU. The code below shows how to run on 4 GPUs.
- (Optional: Export your cache directory and) Activate your env with
singularity shell -B /your/bindings/ --nv colorsinglet.sif
- Go to
/path/to/weaver-core/
- Run the following command:
torchrun --standalone --nnodes=1 --nproc_per_node=4 /your/path/to/weaver-core/weaver/train.py \
--data-train /your/path/to/data/*.root \
--data-config /your/path/to/configs/config_for_weaver_training.yaml \
--network-config /your/path/to/particle_transformer/networks/example_ParticleTransformer.py \
--model-prefix /your/path/to/model_weights/mykey4hepmodel/ \
--num-workers 0 --gpus 0,1,2,3 \
--batch-size 2048 --start-lr 1e-3 --num-epochs 60 \
--optimizer ranger --fetch-step 0.01 \
--log-wandb --wandb-displayname myfirsttraining \
--wandb-projectname FCC-tagging-with-key4hep \
--lr-scheduler reduceplateau --backend nccl
The --nproc_per_node=4
specifies that you have 4 GPUs and --gpus 0,1,2,3
which ones you would like to use. For every training on a different dataset, create a new config file and use the .auto.yaml
created by weaver to run inference later. Create a folder where to store the model weights --model-prefix /your/path/to/model_weights/mykey4hepmodel/
. You can load model weights by adding --load-model-weights /your/path/to/model_weights/old_training/_epoch-X_state.pt \
to the command with X
being the epoch you want to load. To find out more, check out the wiki.
- Once satisfied with the training, you should have your model weights saved like
/your/path/to/model_weights/mykey4hepmodel/_best_epoch_state.pt
. - Optional: You can also run inference using weaver. Please use data other than what you used for training to run inference. Use the auto-generated config file from training to run the inference. You can use the code provided in
extras/plotting
to check if yourresults.root
file matches the performances expected.
python3 -m weaver.train --predict --data-test /your/path/to/test-data/*.root \
--data-config your/path/to/configs/config_for_weaver_training.248dcd877468b36a361a73654bb4e913.auto.yaml \
--network-config /your/path/to/particle_transformer/networks/example_ParticleTransformer.py \
--model-prefix /your/path/to/model_weights/mykey4hepmodel/_best_epoch_state.pt \
--gpus 0 --batch-size 64 \
--predict-output /your/path/to/mykey4heptagging/results.root
- Export your
.pt
model to ONNX. Follow these instructions. - Check if you need to do adjustments on the source code. Follow these instructions.
- Run inference and check the performance. Help yourself using the scripts in
extras/plotting
. - Be proud of yourself. Well done :)
If you wish to use a different model for tagging, you will need to export the trained model to ONNX. Here, we describe how to transform a model with weaver from .pt
to .onnx
.
To export your favorite model best_model.pt
(e.g. Particle Transformer) using weaver
to onnx, run:
python3 -m weaver.train \
-c myConfigFromTraining.auto.yaml \
-n /path-to/particle_transformer/networks/example_ParticleTransformer.py \
-m /path-to/best_model.pt \
--export-onnx my-onnx-model.onnx
For that, we need an appropriate environment, as the one provided by weaver
does not work for the conversion to ONNX. The environment can be set-up with the YAML file in extras
folder like:
conda env create -f env_for_onnx.yml
conda activate weaver
torch_geometric
is still not supported by this environment (not needed for using Particle Transformer, but e.g. for L-GATr).
- you need to change the paths to the model and its JSON config file in the steering file (here:
k4MLJetTagger/k4MLJetTagger/options/createJetTags.py
) by settingmodel_path
andjson_path
in theJetTagger
transformer initialization. - You should not need to change anything apart from the steering file, assuming:
- You adopted the
flavor_collection_names
in the steering filecreateJetTags.py
matching the order, label, and size that the network expects. E.g., if the network expects the first output to represent the probability of a$b$ -jet, then the first item in the listflavor_collection_names
needs to beyourCollectionName_B
. If your network distinguishes between$n$ flavors, make sure to provide$n$ collection names. - You used weaver to train your model. (If not, you need to adapt a lot. Start building your own
WeaverInterface
header and source file, adopt the way the StructJet
is transformed to fit the input formatted expected by your network (here done inHelpers
with this function:from_Jet_to_onnx_input
) and change the handling of thejson
config file if needed, including the extraction of all necessary inputs in thetagger
function inJetTagger.cpp
) - the
output_names
of the model in the JSON config file have the formatyourname_isX
. If this changes (e.g. to_X
), you need to adopt thecheck_flavors
function and theto_PDGflavor
map inHelpers
. - The naming of the input observables follows the FCCAnalyses convention. (As I don't like it, I use my own. Therefore, I have written a
VarMapper
class inHelpers
that converts into my own key4hep convention. If you work with other conventions, just update theVarMapper
). I hope that in the future, people will adopt my convention for training the network, too, and thenVarMapper
will not be needed anymore. Read this section to find out how to adopt the code. - You use the same (or less) input parameter to the network. In case you want to extract more, have a look at
JetObservablesRetriever
and modify thePfcand
Struct inStructs.h
- You adopted the
- Extract the wanted parameter in
JetObservablesRetriever
and modify thePfcand
Struct inStructs.h
by adding the new observables as an attribute. - Modify
JetObsWriter
and add your new observable to be saved in the output root file. - Retrieve a root file (default
jetconst_obs.root
) by runningk4run ../k4MLJetTagger/options/writeJetConstObs.py
which uses theJetObsWriter
. To create larger data, submit the jobs to condor (seeextras/submit_to_condor
) explained here. - Use the root output (
jetconst_obs.root
, or to be more precise, the root files from your condor submission because you need plenty of data to retrain a model) to retrain the model. - Convert your trained model to ONNX as explained above.
You may find helpful resources in the extras
folder.
-
Creation of a conda env for onnx export: To export a model trained with weaver, you need an appropriate environment. You can set it up with
env_for_onnx.yml
. See this section. -
Example weaver config file: The
config_for_weaver_training.yaml
is an example config file to be used for training a network with weaver using the key4hep convention for jet constituent observables names. -
Submitting jobs to condor: You will find
jettagswriter.sub
which submits thewriteJetTags.py
steering file to condor to run it on large data samples. It produces root files with the saved jet tags (reco from the tagger and MC). Please runpython write_sub_JetTagsWriter.py
to create the.sub
file with your output paths and data etc. by modifiying the python script. The same is true forjetobswriter.sub
, which submits a job to condor to run thewriteJetConstObs.py
steering file. For more information about condor jobs, see this documentation.
# go on lxplus
cd /path/to/k4MLJetTagger/extras/submit_to_condor/
# MODIFY the python script to match your data, paths and job
python write_sub_JetObsWriter.py
# check if jetobswriter.sub looks ok
condor_submit jetobswriter.sub
- Plots: You can find some plotting scripts in
extras/plotting
:jetobs_comparison.ipynb
is a notebook that plots the distribution of jet constituent observables used for tagging retrieved with a steering file likewriteJetConstObs.py
that uses the Gaudi algorithmJetObsWriter
. The notebook compares the distributions of two jet observables from different root files. The helper functions for this notebook are defined inhelper_jetobs.py
rocs_comparison.ipynb
is a notebook that compares two ROC curves for different flavors. The data used for the ROCs should come from two root files retrieved with a steering file likewriteJetTags.py
that uses the Gaudi algorithmJetTagsWriter
. The helper functions for this notebook are defined inhelper_rocs.py
. If you don't like Jupyter Notebooks and just want to save the plots usesave_rocs.py
and adopt the path where you want to save the plots inhelper_rocs.py
- The magnetic field
$B$ of the detector is needed at one point to calculate the helix parameters of the tracks with respect to the primary vertex. The magnetic field is hard coded at the moment. It would be possible to retrieve it from the detector geometry (code already added; see theHelper
file), but therefore, one must load the detector in the steering file, e.g. like this. As we use the v05 version of CLD at the moment, loading the detector is slow and not worth it to only set$Bz=2.0$ (in my opinion). With a newer detector version (e.g. v07) this might be worth investigating. - Currently, the network used was trained using the FCCAnalyses convention for naming the jet constituents observables. The naming is quite confusing; this is why I used my own convention that matches the key4hep convention. The class
VarMapper
inHelpers
helps to switch between the two conventions. In the future, if retraining a model, I highly suggest switching to the convention used here when training the model to get rid of the FCCAnalyses convention. To do so, train the network with a yaml file likeextras/config_for_weaver_training.yaml
and root files created withwriteJetConstObs.py
, which use the key4hep convention. To run inference here in key4hep, you only need to modify the functionfrom_Jet_to_onnx_input
inHelpers
where theVarMapper
is used. Remove it; there should be no need to convert conventions anymore.
- ONNX implemention in FCCAnalyses: Strongly inspired this code.
- k4-project-template
If you find this code helpful and use it in your research, please cite:
@manual{aumiller_2024_4pcr6-r0d06,
title = {Jet Flavor Tagging Performance at FCC-ee},
author = {Aumiller, Sara and
Garcia, Dolores and
Selvaggi, Michele},
month = nov,
year = 2024,
doi = {10.17181/4pcr6-r0d06},
url = {https://doi.org/10.17181/4pcr6-r0d06}
}
The performance of this jet-tagger is discussed in Section 4.1