Acoustic Descriptor Extraction tool for processing sound on High Performance Computing clusters.
SoundADE will recursively identify audio files within a directory structure and read these for processing by the pipeline. The pipeline will then compute a set of acoustic descriptors, pull out BirdNET species detection probabilities. If site-level information such as latitude, longitude and timezone are provided as a separate sites file, these will be used to extract solar and weather data.
Install uv package. You can then bundle dependencies. Finally you can run the test suite, and the different parts of the pipeline either independently or as a whole using the CLI.
uv sync --locked
uv run pytest -v
uv run main.py --help
Install docker-ce package. You can then build and run the full pipeline using:
docker compose up --buildSome HPC admins don't give users sudo privileges required to run docker. The project can be built using singularity which doesn't require privileges and is usually installed on HPC systems.
singularity build --fakeroot app.sif app.def
When on a machine you have sudo privileges, fakeroot isn't required, you can run:
singularity build --ignore-fakeroot-command -F app.sif app.defIf using SLURM based HPC, scripts to schedule these builds can be found under the slurm directory.
As with docker, you must rebuild every time you make changes to the source code.
If you need to inspect your container once its build, run:
singularity exec --env-file .env --bind $DATA_PATH:/data app.sif /bin/shRun the whole pipeline, specifying the relevant option depending on your setup configuration (docker / singularity / local python environment).
./run.sh
Create your own custom dataset config file (see examples in ./config). This is where you should specify FFT and BirdNET parameters. Default parameters for the FFT and BirdNET will be set by the pipeline if none are specified, however you need to specify a means for the pipeline to extract information from the file paths, such as timestamp and site-level information in the form of a regular expression. You can specify additional site-level information, see below for more details.
.env contains run time settings for the pipeline:
DATA_PATH: the path to the audio data you want processed N.B. Yoursite_locations.parquetfile must also be in this locationSAVE_PATH: the path to where you want the results saved. Yourlocations.parquetfile must be in this location. For more details on thelocations.parquetsee below.CORES: the number of cores (local) or jobs (HPC) to be deployed to process your dataMEM_PER_CPU: the integer number of gigabytes of RAM deployed per core or job
A regular expression is required to find and extract audio files along with their location information. See ./config for examples.
A file specifying site-specific information is required for the pipeline to run.
| site_id | site_name | latitude | longitude | timezone |
|---|---|---|---|---|
| string | integer | string | float32 | float32 | string |
You need to ensure the site_name field matches the regular expression defined in the data class.
For example, consider the following folder structure:
└─── <site_level_1>
└── <site_level_2>
├── <site_level_2>
├── ├─ <timestamp>.wav
│ ├─ ....
│ └─ <timestamp>.wav
├── <site_level_2>
└── ├─ <timestamp>.wav
├─ ....
└─ <timestamp>.wav
The site_name variable should match /<site_level_1>/<site_level_2>/<site_level_3>.
└─── EC
└── TE
├── 9
├── ├─ 20150619_0630.wav
│ ├─ ....
│ └─ 20150621_0317.wav
├── 10
└── ├─ 20150619_0630.wav
├─ ....
└─ 20150621_0317.wav
In this case <site_level_1> is the country (EC = Ecuador), <site_level_2> is a site identifier (TE), and <site_level_3> is a recorder ID number. Therefore the site_name column must contain records '/EC/TE/9' for '/EC/TE/10'. The depth for the site level is arbitrary, you can define as many as you like in the regular expression for discovering audio files.
Run the test suite:
uv run pytest -v
If you want to make a contribution to the codebase, please correspond with the package creators specified in the pyproject.toml.