WhAM is a transformer-based audio-to-audio model designed to synthesize and analyze sperm whale codas. Based on VampNet, WhAM uses masked acoustic token modeling to capture temporal and spectral features of whale communication. WhAM generates codas from a given audio context, enabling three core capabilities:
-
Acoustic Translation: The ability to style-transfer arbitrary audio prompts (e.g., human speech, noise) into the acoustic texture of sperm whale codas.
-
Synthesizing novel "pseudocodas".
-
Providing audio embeddings for downstream tasks such as social unit and spectral feature ("vowel") classification.
See our NeurIPS 2025 publication for more details.
-
Clone the repository:
git clone https://github.com/Project-CETI/wham.git cd wham -
Set up the environment:
conda create -n wham python=3.9 conda activate wham
-
Install dependencies:
# Install the wham package pip install -e . # Install VampNet pip install -e ./vampnet # Install madmom pip install --no-build-isolation madmom # Install ffmpeg conda install -c conda-forge ffmpeg
-
Download model weights: Download the weights and extract to
vampnet/models/.
To run WhAM locally and prompt it in your browser:
python vampnet/app.py --args.load conf/interface.yml --Interface.device cudaThis will provide you with a Gradio link to test WhAM on inputs of your choice.
You only need to follow these to fine-tune your own version of WhAM. First, obtain the original vampnet weights by following the instructions in the . Download
c2f.pth and codec.pth and replace the weights you previously downloaded in
vampnet/models.
Second, obtain data:
-
Domain adaptation data:
-
Download audio samples from the WMMS 'Best Of' Cut. Save them under
vampnet/training_data/domain_adaptation. -
Download audio samples from the BirdSet Dataset. Save these under the same directory
-
Finally, download all samples from the AudioSet Dataset with the label
Animaland once again save these into the directory
-
-
Species-specific finetuning: (Forthcoming later in December.)
With data in hand, navigate into vampnet and perform Domain Adaptation:
python vampnet/scripts/exp/fine_tune.py "training_data/domain_adaptation" domain_adapted && python vampnet/scripts/exp/train.py --args.load conf/generated/domain_adapted/coarse.yml && python vampnet/scripts/exp/train.py --args.load conf/generated/domain_adapted/c2f.ymlThen fine-tune the domain-adapted model. Create the config file with the command:
python vampnet/scripts/exp/fine_tune.py "training_data/species_specific_finetuning" fine-tunedTo select which weights you want to use as a checkpoint, change fine_tune_checkpoint in conf/generated/fine-tuned/[c2f/coarse].yml to ./runs/domain_adaptation/[coarse/c2f]/[checkpoint]/vampnets/weights.pth. [checkpoint] can be latest in order to use the last saved checkpoint from the previous run, though it is recommended to manually verify the quality of generations over various checkpoints as overtraining can often cause degradation in audio quality, especially with smaller datasets. After making that change, run the command:
python vampnet/scripts/exp/train.py --args.load conf/generated/fine-tuned/coarse.yml && python vampnet/scripts/exp/train.py --args.load conf/generated/fine-tuned/c2f.ymlAfter following these steps, you should be able to generate audio via the browser by running:
python app.py --args.load vampnet/conf/generated/fine-tuned/interface.ymlNote: The coarse and fine weights can be trained separately if compute allows. In this case, you would call the two scripts:
python vampnet/scripts/exp/train.py --args.load conf/generated/[fine-tuned/domain_adaptated]/coarse.ymlpython vampnet/scripts/exp/train.py --args.load conf/generated/[fine-tuned/domain_adaptated]/c2f.ymlAfter both are finished running, ensure that both resulting weights are copied into the same copy of WhAM.
-
Marine Mammel Data: Download audio samples from the WMMS 'Best Of' Cut. Save them under
data/testing_data/marine_mammals/data/[SPECIES_NAME].[SPECIES_NAME]must match the species names found inwham/generation/prompt_configs.py.
-
Sperm Whale Codas: (Forthcoming later in December.)
-
Generate artifical beeps for experiments.
data/generate_beeps.sh
Note: Access to the DSWP+CETI annotated is required to reproduce all results; as of time of publication, only part of this data is publicly available. Still, we include the following code as it may be useful for researchers who may benefit from our evaluation pipeline.
To reproduce Table 1 (Classification Accuracies) and Figure 7 (Ablation Study):
Table 1 Results:
cd wham/embedding
./downstream_tasks.sh- Runs all downstream classification tasks.
- Baselines: Run once.
- Models (AVES, VampNet): Run over 3 random seeds; reports mean and standard deviation.
Figure 7 Results (Ablation):
cd wham/embedding
./downstream_ablation.sh- Outputs accuracy scores for ablation variants (averaged across 3 seeds with error bars).
Figure 12: Frechet Audio Distance (FAD) Scores Calculate the distance between WhAM's generated results and real codas:
# Calculate for all species
bash wham/generation/eval/calculate_FAD.sh
# Calculate for a single species
bash wham/generation/eval/calculate_FAD.sh [species_name]- Runtime: ~3 hours on an NVIDIA A10 GPU.
Figure 3: FAD with Custom/BirdNET Embeddings To compare against other embeddings:
- Convert your
.wavfiles to.npyembeddings. - Place raw coda embeddings in:
data/testing_data/coda_embeddings - Place comparison embeddings in subfolders within:
data/testing_data/comparison_embeddings - Run:
python wham/generation/eval/calculate_custom_fad.py
For BirdNET embeddings, refer to the official repo.
Table 2: Embedding Type Ablation Calculate distances between raw codas, denoised versions, and noise profiles:
bash wham/generation/eval/FAD_ablation.sh- Prerequisites: Ensure
data/testing_data/ablation/noiseanddata/testing_data/ablation/denoisedare populated. - Runtime: ~1.5 hours on an NVIDIA A10 GPU.
Figure 13: Tokenizer Reconstruction Test the mean squared reconstruction error:
bash wham/generation/eval/evaluate_tokenizer.shPlease use the following citation if you use this code, model or data.
@inproceedings{wham2025,
title={Towards A Translative Model of Sperm Whale Vocalization},
author={Orr Paradise, Pranav Muralikrishnan, Liangyuan Chen, Hugo Flores Garcia, Bryan Pardo, Roee Diamant, David F. Gruber, Shane Gero, Shafi Goldwasser},
booktitle={Advances in Neural Information Processing Systems 39: Annual Conference
on Neural Information Processing Systems 2025, NeurIPS 2025, San Diego, CA, USA},
year={2025}
}