Introducing "Number Token Loss" (NTL) for language models to improve numerical reasoning by using regression-based loss functions on number tokens. Achieves better performance on math tasks without computational overhead ๐
Number Token Loss (NTL) introduces a novel approach to enhance language models' numerical reasoning capabilities. Unlike traditional cross-entropy loss that treats all incorrect predictions equally, NTL incorporates the numerical proximity of tokens, providing regression-like behavior at the token level.
Cross Entropy is nominal-scale and thus assigns equal loss to all incorrect predictions. This makes sense for normal tokens but not for number tokens:
4
, predicting 3
or 9
should not give equal loss ๐ค๐ฑNTL fixes this! ๐๐ช
For all number tokens, NTL increases with distance from ground truth just like a regression loss.
But it doesn't need an extra head, it allows computing a regression-like loss directly on a token head.
We propose two schemes:
NTL-WAS โ Wasserstein-1 distance between predicted and one-hot number distributions (see plot above).
NTL-MSE โ Dot-product expectation of numeric value with squared error (most intuitive but has some undesired local minima)
- ๐ฏ Model-Agnostic: NTL is just a loss โ applicable to any LM (e.g., Transformer, Mamba) in any architecture (encoder-decoder, decoder-only).
- ๐ Plug-and-Play: NTL requires only a mapping from tokens to numeric values and works with digit-level and multi-digit tokenizations.
- โก No computational overhead: NTL adds only ~1% compute time to loss calculation which is negligible over a full training step.
- ๐ Consistently improves performance: NTL outperforms plain cross entropy across multiple architectures and math benchmarks.
- ๐ข Performs true regression: On regression tasks a LM head with NTL matches a dedicated regression head.
- ๐ Scales to large models: Even Granite 3.2 2B and T5-3B benefit heavily from NTL on math tasks like GSM8K.
- ๐ Paper: Regress, Don't Guess โ A Regression-like Loss on Number Tokens for Language Models
- ๐ Project Page: https://tum-ai.github.io/number-token-loss/
- ๐ฎ Interactive Demo: https://huggingface.co/spaces/jannisborn/NumberTokenLoss
- ๐ NeurIPS 2024 MathAI Workshop Poster: View Poster
- ๐ป Lightweight Integration Example: loss_integration.ipynb - Easy integration into your own models
- Python 3.9 or higher
- CUDA-compatible GPU (recommended)
-
Clone the repository
git clone https://github.com/tum-ai/number-token-loss.git cd number-token-loss
-
Create and activate environment
conda create -n ntl python=3.10 conda activate ntl
-
Install dependencies
pip install -r requirements.txt pip install -e .
-
Configure Weights & Biases
wandb login export WANDB_ENTITY='<your_entity>' export WANDB_PROJECT='<your_project_name>'
For a minimal working example of how to integrate NTL into your existing Hugging Face model, check out our lightweight integration notebook. It demonstrates:
- How to add NTL to any decoder-only language model (e.g., LLaMA, GPT)
- Custom trainer implementation with CE+NTL loss
- Complete working example with TinyLLaMA
The main training script uses Hydra for configuration management:
python src/ntl/run_language_modeling.py \
dataset_args=mathematics_dataset \
model_args=vanilla_t5_ntl \
training_args=train
- Datasets:
gsm8k
,mathematics_dataset
,arithmetic
,rjokes
,multirc
- Models:
vanilla_t5
,vanilla_t5_ntl
,rt
,rt_ntl
,xval
- Training:
train
,eval
Override default parameters via command line:
python src/ntl/run_language_modeling.py \
model_args=vanilla_t5_ntl \
training_args=train \
training_args.per_device_train_batch_size=8 \
model_args.number_token_loss_weight=0.3
Download the used datasets
- Mathematics Dataset from DeepMind:
- Get the data from https://console.cloud.google.com/storage/browser/mathematics-dataset;tab=objects?pli=1&prefix=&forceOnObjectsSortingFiltering=false
- Execute create_data_splits.py
- Put the .txt files under data/mathematics_dataset-v1.0/
- Ablation Studies on part of the Mathematics Dataset
- Execute arith_create_splits.py
- The resulting files (arithmetic_train.txt, arithmetic_val.txt, arithmetic_test_interpolate.txt, arithmetic_test_extrapolate.txt) should be under data/mathematics_dataset-v1.0/
- NTL on a regression task (rjokes dataset)
- Download data from https://github.com/orionw/rJokesData
- Put train.tsv, dev.tsv and test.tsv under data/rjokes-dataset/data
- Execute generate_dataset.py
- MultiRC dataset
- Download the MultiRC dataset from https://dl.fbaipublicfiles.com/glue/superglue/data/v2/MultiRC.zip
- Put the train.jsonl, val.jsonl and test.jsonl files under data/multirc/data
- Execute generate_dataset.py
- The generated files should be under data/multirc/data/preprocessed
Model | Configuration | Command |
---|---|---|
T5 Baseline | Standard Cross-Entropy | python src/ntl/run_language_modeling.py run_specific_config@_global_=mathematics_dataset_run model_args=vanilla_t5 dataset_args=mathematics_dataset |
T5 + NTL-MSE | MSE-based NTL | python src/ntl/run_language_modeling.py run_specific_config@_global_=mathematics_dataset_run model_args=vanilla_t5_ntl dataset_args=mathematics_dataset |
T5 + NTL-WAS | Wasserstein-based NTL | python src/ntl/run_language_modeling.py run_specific_config@_global_=mathematics_dataset_run model_args=vanilla_t5_ntl model_args.number_token_loss_with_wasserstein=true dataset_args=mathematics_dataset |
Comprehensive ablation studies on arithmetic subsets:
View Ablation Commands
NTL-MSE with Different Weights:
# ฮป = 0.3
python src/ntl/run_language_modeling.py dataset_args=arithmetic model_args=vanilla_t5_ntl model_args.number_token_loss_with_wasserstein=false model_args.number_token_loss_weight=0.3 training_args.special_name=NTL-MSE_Lambda0.3
# ฮป = 0.8
python src/ntl/run_language_modeling.py dataset_args=arithmetic model_args=vanilla_t5_ntl model_args.number_token_loss_with_wasserstein=false model_args.number_token_loss_weight=0.8 training_args.special_name=NTL-MSE_Lambda0.8
# ฮป = 2.0
python src/ntl/run_language_modeling.py dataset_args=arithmetic model_args=vanilla_t5_ntl model_args.number_token_loss_with_wasserstein=false model_args.number_token_loss_weight=2.0 training_args.special_name=NTL-MSE_Lambda2.0
Alternative Loss Functions:
# NTL-MAE
python src/ntl/run_language_modeling.py dataset_args=arithmetic model_args=vanilla_t5_ntl +model_args.number_token_loss_function=mae training_args.special_name=NTL-MAE_Lambda0.3
# NTL-Huber
python src/ntl/run_language_modeling.py dataset_args=arithmetic model_args=vanilla_t5_ntl +model_args.number_token_loss_function=huber training_args.special_name=NTL-Huber_Lambda0.3
Scale NTL to 3B parameter models:
# T5-3B Baseline
python src/ntl/run_language_modeling.py run_specific_config@_global_=gsm8k_runs model_args=vanilla_t5 dataset_args=gsm8k
# T5-3B + NTL-WAS
python src/ntl/run_language_modeling.py run_specific_config@_global_=gsm8k_runs model_args=vanilla_t5_ntl dataset_args=gsm8k model_args.number_token_loss_weight=0.3
python src/ntl/run_language_modeling.py \
model_args=vanilla_t5 \
training_args=train \
run_specific_config@_global_=debug_config
python src/ntl/run_language_modeling.py \
model_args=vanilla_t5_ntl \
training_args=eval \
model_args.model_name_or_path=<path_to_checkpoint>
If you find this work useful, please cite our paper:
@inproceedings{zausinger2025regress,
title = {Regress, Don't Guess โ A Regression-like Loss on Number Tokens for Language Models},
author = {Jonas Zausinger and Lars Pennig and Anamarija Kozina and Sean Sdahl
and Julian Sikora and Adrian Dendorfer and Timofey Kuznetsov
and Mohamad Hagog and Nina Wiedemann and Kacper Chlodny
and Vincent Limbach and Anna Ketteler and Thorben Prein
and Vishwa Mohan Singh and Michael Danziger and Jannis Born},
booktitle = {Proc. of the 42nd International Conference on Machine Learning (ICML)},
year = {2025},
url = {https://tum-ai.github.io/number-token-loss/}
}
This project is licensed under the MIT License - see the LICENSE file for details.