CodeRosetta

Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

Repository of CodeRosetta (Paper, Website).

CodeRosetta is an encoder-decoder transformer model designed to translate code between C++ and CUDA, and Fortran and C++.

Extendable: CodeRosetta can be easily extended to support other programming languages and parallel programming libraries.
Customized training objectives: CodeRosetta leverages tailored pretraining and training objectives to enhance translation between programming languages and parallel programming interfaces.
Supports CUDA and Fortran: The current version of CodeRosetta supports translation from C++ to CUDA and from Fortran to C++.

If you're more of a paper person, check out our paper:

⚠️ Work in Progress: This is a public version of our code, this version is published to foster further research efforts. Please note that the current version may not fully reproduce our work, as a complete end-to-end validation has not been conducted. Thank you!

Requirements

CodeRosetta is built on top of Transformers and Accelerate.

pip install transformers[sentencepiece] accelerate datasets evaluate sacrebleu seqeval codebleu tree-sitter-cpp==0.21.0

Datasets

Dataset Structure

CodeRosetta requires 6 files for for translation from language A to B in the following format:

lang_a.mono.train.format: Train set of the first language.
lang_b.mono.train.format: Train set of the second language
lang_a.para.valid.format: Validation set of the first language.
lang_b.para.valid.format: Validation set of the second language.
lang_a.para.test.format: Test set of the first language.
lang_b.para.test.format: Test sef of the second language.

mono files are unpaired files. There are not relation between them. para files are parallel/pair sets.
For instance, the first sample in lang_a.para.valid.format should correspond to the first sample in lang_b.para.valid.format. The same for test set.

Datasets:

C++ to CUDA:
This dataset is retrieved from BabelTower authors.
Test and Validation set are located here.
We have uploaded the whole dataset here.

Fortran To C++:
The train set of this dataset is retrieved from StackV2.
Paired trainset and testset are retrieved from HPC_Fortran_CPP.
We have uploaded the whole dataset here.

Training

For C++ to CUDA training:

If you've got one GPU:

python main.py --langs cpp cuda --train_mode mlm aer bt ft eval --dataset_format tok

If you have multiple GPUs use accelerate launch:

accelerate launch --num_processes=2 main.py --langs cpp cuda --train_mode mlm aer bt ft eval --dataset_format tok

main.py has a set of configurable flags.
please run python main.py -h to see the list of all available flags.

For Fortran to C++ training:

python main.py --langs fortran cpp --train_mode mlm bt ft eval --dataset_format jsonl

To leverage multiple-GPU training use accelerate launch.

Training Arguments

- `--whole_word_masking_mlm`: Enables whole word masking for MLM.
- Action: `store_true`
- Default: `True`

- `--train_mode`: List of training modes to use.
- Choices: `["mlm", "aer", "bt", "ft", "eval"]`

- `--only_bt`: Disables DAE and enables only BT.
- Action: `store_true`

- `--only_dae`: Disables BT and enables only DAE.
- Action: `store_true`

- `--wwm_probability`: Whole word masking probability.
- Type: `float`
- Default: `0.15`

- `--shuffle_batch_order`: Flag to shuffle batch order for seq2seq training.
- Action: `store_true`
- Default: `False`

- `--shuffle_within_batches`: Flag to shuffle samples within batches.
- Action: `store_true`
- Default: `True`

- `--learning_rate_mlm`: Learning rate for MLM training.
- Type: `float`
- Default: `0.00008`

- `--learning_rate_aer`: Learning rate for AER training.
- Type: `float`
- Default: `0.000005`

- `--learning_rate_bt`: Learning rate for BT training.
- Type: `float`
- Default: `0.00005`

- `--learning_rate_ft`: Learning rate for FT training.
- Type: `float`
- Default: `0.00004`

- `--weight_decay`: Weight decay factor.
- Type: `float`
- Default: `0.01`

- `--num_warmup_steps`: Number of warmup steps.
- Type: `int`
- Default: `0`

- `--percent_warmup_steps`: Percentage of warmup steps.
- Type: `float`
- Default: `0.01`

- `--scheduler_type`: Scheduler type for learning rate adjustment.
- Choices: `["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup", "inverse_sqrt", "reduce_lr_on_plateau"]`
- Default: `inverse_sqrt`

- `--chunk_size`: Maximum length of each sample.
- Type: `int`
- Default: `512`

- `--enable_early_stopping`: Enables early stopping.
- Action: `store_true`

- `--early_stopping_threshold`: Threshold for early stopping.
- Type: `int`
- Default: `0`

- `--early_stopping_patience`: Number of steps with no improvement before stopping early.
- Type: `int`
- Default: `5`

- `--batch_size`: Batch size for training.
- Type: `int`
- Default: `32`

- `--max_gpu_batch_size`: Maximum GPU batch size.
- Type: `int`
- Default: `16`

- `--num_train_epochs_mlm`: Number of training epochs for MLM.
- Type: `int`
- Default: `100`

- `--num_train_epochs_aer`: Number of training epochs for AER.
- Type: `int`
- Default: `10`

- `--num_train_epochs_bt`: Number of training epochs for BT.
- Type: `int`
- Default: `20`

- `--num_train_epochs_ft`: Number of training epochs for FT.
- Type: `int`
- Default: `10`

- `--max_steps`: Maximum number of steps for training (overrides num_train_epochs).
- Type: `int`
- Default: `0`

- `--evaluation_strategy`: Evaluation strategy during training.
- Type: `str`
- Default: `epoch`

- `--dae_warmup_steps`: Number of steps for DAE warmup phase.
- Type: `int`
- Default: `0`

- `--dae_word_mask`: Word masking ratio for DAE.
- Type: `float`
- Default: `0.15`

- `--dae_word_dropout`: Word dropout ratio for DAE.
- Type: `float`
- Default: `0.20`

- `--dae_word_replacement`: Word replacement ratio for DAE.
- Type: `float`
- Default: `0`

- `--dae_word_insertion`: Word insertion ratio for DAE.
- Type: `float`
- Default: `0.15`

- `--dae_word_shuffle`: Word shuffle ratio for DAE.
- Type: `float`
- Default: `1`

- `--ratio_steps_update`: Defines when DAE ratio should update (steps, epoch, half_epoch, or quarter_epoch).
- Type: `str`
- Default: `None`

- `--ratio_percent_update`: Percentage of changes for updating DAE ratios.
- Type: `float`
- Default: `0.05`

- `--ratio_percent_update_dropout`: Percentage of changes for updating the dropout ratio in DAE.
- Type: `float`
- Default: `0.025`

- `--max_corruption_percent`: Maximum allowed sentence corruption for DAE.
- Type: `float`
- Default: `0.6`

Pretrained Models

We release the pretrained C++ to CUDA and Fortran to C++ models.
C++ to CUDA model and tokenizer are located here.
Fortran to C++ model and tokenizer are located here.

Evaluation:

To ran the evaluation ensure that model, tokenizer and data are placed to their repective directories.

For C++ to CUDA evaluation

python main.py --langs cpp cuda --train_mode eval --dataset_format tok

For Fortan to C++ evaluation

python main.py --langs fortran cpp --train_mode eval --dataset_format jsonl

For inference with multiple GPUs use accelerate launch.

Adding new languages

In order to add new languages, ensure src/util/langs_keyword.py is updated and has information about the reserverd keywords for the new langauges. Depending on the preprocessing, filtering steps required for the new language, please check dataset py files in (src/dataset ) and create your own dataset.py to reflect the preprocessings that are required for your datasets.

Citation

@inproceedings{coderosetta:neurips:2024,
  title = {CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming},
  author = {TehraniJamsaz, Ali and Bhattacharjee, Arijit and Chen, Le and Ahmed, Nesreen K and Yazdanbakhsh, Amir and Jannesari, Ali},
  booktitle = {NeurIPS},
  year = {2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
src		src
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeRosetta

Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

Requirements

Datasets

Dataset Structure

Datasets:

Training

Training Arguments

Pretrained Models

Evaluation:

Adding new languages

Citation

About

Releases

Packages

Languages

License

tehranixyz/CodeRosetta

Folders and files

Latest commit

History

Repository files navigation

CodeRosetta

Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming

Requirements

Datasets

Dataset Structure

Datasets:

Training

Training Arguments

Pretrained Models

Evaluation:

Adding new languages

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages