PanDelos-plus: a parallel algorithm for computing sequence homology in pangenomic analysis.
- PanDelos-plus
PanDelos-plus implements a dictionary-based method for pan-genome content discovery. This updated version is a re-engineered and parallelized C++ implementation of the original PanDelos. It integrates several Python modules with a C++ library, coordinated via the pandelosp.sh Bash script, which facilitates streamlined access to the complete PanDelos-plus pipeline.
Important
Make sure to have git installed on your machine.
If you don't have git installed, you can install it with the following command on Ubuntu machine:
sudo apt-get install gitIf you don't have an ubuntu machine you can follow the instruction on the Git website.
Tool installation:
git clone https://github.com/synbionics/PanDelos-plus.gitFor the local installation on a ubuntu machine, you can run the following commands:
Required dependencies:
sudo apt update
sudo apt-get install -y bash python3 python3-pip python-virtualenv build-essential time gitEnter the cloned repository:
cd PanDelos-plusCreate and activate the virtual environment:
virtualenv -p python3 pdp_env
source pdp_env/bin/activateInstall required python packages:
python3 -m pip install -r pip-requirements.txtCompile the tool:
bash compile.shIf you dont have the virtual environment activated, you can activate it with the following command:
source pdp_env/bin/activatebash pandelosp.sh -i files/pdi/mycoplasma5.pdi -o mycoplasma5This script will run the PanDelos-plus pipeline on the input file files/pdi/mycoplasma5.pdi and save the output in the mycoplasma5.clus and mycoplasma5.json files. The output files will contain the gene families computed by the pipeline.
Now you can deactivate the virtual environment with the following command:
deactivatesure to have git installed on your machine.
If you don't have git installed, you can install it with the following command on Ubuntu machine:
sudo apt-get install gitIf you don't have an ubuntu machine you can follow the instruction on the Git website.
Tool installation:
git clone https://github.com/synbionics/PanDelos-plus.gitMake sure that you have docker installed on your machine.
If you don't have docker installed, you can install it following the instructions on the Docker website.
Move inside the PanDelos-plus folder:
cd PanDelos-plusIf you are running a linux machine, you probably need to change the permissions of the following folders:
chmod -R 777 input
chmod -R 777 outputBuild the container:
Important
If you are on a windows machine you probably have to start docker engine by opening the docker desktop application.
docker compose build --no-cacheNote that
docker composecommand may raise some errors so try also withdocker-compose
Important
Check that the input and output folders are writable by the user running the docker container.
Copy file inside the input folder:
cp files/pdi/mycoplasma5.pdi input/mycoplasma5.pdiIf you are on windows you probably have to use
cp .\files\pdi\mycoplasma5.pdi input\mycoplasma5.pdi
Run the pipeline:
docker compose run --rm pandelosplus bash pandelosp.sh -i input/mycoplasma5.pdi -o output/mycoplasma5Note that
docker composecommand may raise some errors so try also withdocker-compose
This script will run the PanDelos-plus pipeline inside the docker on the input file input/mycoplasma5.pdi and save the output in the output/mycoplasma5.clus and output/mycoplasma5.json files.
The output files will contain the gene families computed by the pipeline.
PanDelos-plus takes as input a complete set of gene sequences stored in a .pdi text file belonging to any of the studied genomes.
This file must have a "2 line pattern" where:
- The first line represents the identification line, composed of 3 parts (genome identifier, the gene identifier and the gene product) separated by a tabulation character.
- The second line consists of the complete gene sequence in FASTA amino acid format reported in a single line.
IMPORTANT No blank lines are admitted in the entire file.
Example of valid file composed of 5 genes grouped in 2 genomes
NC_000913 NC_000913:NC_000913.3:b0001:1 thr operon leader peptide
MKRISTTITTTITITTGNGAG
NC_000913 NC_000913:NC_000913.3:b0005:1 DUF2502 domain-containing protein YaaX
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR
NC_000913 NC_000913:NC_000913.3:b0018:1 regulatory protein MokC
MLNTCRVPLTDRKVKEKRAMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFTAYESE
NC_007946 NC_007946:NC_007946.1:UTI89_RS06140:1 DUF1382 family protein
MHKASPVELRTSIDLAHSLAQIGVRFVPIPAETDEEFHTLATSLSQKLEMMVAKAEADERDQV
NC_007946 NC_007946:NC_007946.1:UTI89_RS06145:1 DUF1317 domain-containing protein
MTHPHDNIRVGAITFVYSVTKRGWVFHGLSVIRNPLKAQRLAEEINNKRGAVCTKHLLLS
IMPORTANT
Make sure that gene identifiers are unique within the input file. A suggested format to build unique gene identifier is genome_identifier:gene_identifier:unque_integer.
Remember to activate the virtual environment:
source pdp_env/bin/activateAfter you have prepared your input file (supposing it is named as input.pdi), you can run the pipeline as follows:
bash pandelosp.sh -i input.pdi -o outputThe output files will contain the gene families computed by the pipeline.
Remember to deactivate the virtual environment:
deactivateAfter you have prepared your input file (supposing it is in the input folder and is named custom.pdi), you can run the pipeline as follows:
Run the pipeline:
docker compose run --rm pandelosplus bash pandelosp.sh -i input/custom.pdi -o output/customThis script will run the PanDelos-plus pipeline inside the docker on the input file input/custom.pdi and save the output in the output/custom.clus file.
The output files will contain the gene families computed by the pipeline.
Important If you installed pandeslos-plus with docker you must enter inside the container to execute the following steps.
docker compose run --rm pandelosplus bashYou can generate an input file from a set of .gbff files following these steps:
- Download the genbank files of your interest.
- Create a folder and copy all
.gbfffiles in it. - Run the execution script by adding
-g <path_to_folder_with_gbff_files>flag to the execution command to generate a json file in addition to the output file.
Example:
Extract files from
files/gbff.zip.
In this example it will be used files contained in files/gbff/ folder which contains 2 genomes downloaded from NIH Database:
- GCF_016028495.1, for Salmonella enterica (ASM1602849v1).
- GCF_000006945.2, for Salmonella enterica (ASM694v2).
So steps 1 and 2 are already done.
Step 3:
bash pandelosp.sh -i custom.pdi -o custom -g files/gbff/If everithing works fine you will get this output:
mar 18 mar 2025, 10:21:38, CET
Using files contained in: files/gbff/
Converting gbff to gbk
['GCA_000006945.2.gbff', 'GCA_016028495.1.gbff']
Checking gbk files
This may take a while
Reading gbk files from: files/gbff//gbk/
Files found: ['GCA_000006945.2.gbk', 'GCA_016028495.1.gbk']
Processing file: GCA_000006945.2.gbk
Processing genome: GCA_000006945.2
Genome ID: GCA_000006945.2 Sequence ID: AE006468.2
Genome ID: GCA_000006945.2 Sequence ID: AE006471.2
Processing file: GCA_016028495.1.gbk
Processing genome: GCA_016028495.1
Genome ID: GCA_016028495.1 Sequence ID: CP065718.1
Genome ID: GCA_016028495.1 Sequence ID: CP065719.1
All files processed successfully
Generating pdi input file
This may take a while
reading gbk files from files/gbff//gbk/
['GCA_000006945.2.gbk', 'GCA_016028495.1.gbk']
GCA_000006945.2.gbk
GCA_000006945.2
GCA_000006945.2 AE006468.2
GCA_000006945.2 AE006471.2
GCA_016028495.1.gbk
GCA_016028495.1
GCA_016028495.1 CP065718.1
GCA_016028495.1 CP065719.1
writing to custom.pdi
custom.pdi
k = 4
Checking input file (.pdi)
File is correct
Executing main
Computing clusters
Converting clusters to json with GeneBank information
mar 18 mar 2025, 10:21:54, CETNow you can check the output file custom.clus and the json file custom.json.
Note that the json file is enriched with the information from the GenBank files.
You can customize the execution of the pipeline by using the following flags:
| Flag | Description |
|---|---|
| -i | Input file path |
| -o | Output file path |
| -t | Number of threads |
| -m | Enable a slower mode which requires less memory (default: False) |
| -d | Discard value (0 <= d <= 1, default 0.5), check the section below |
| -g | Path to gbk folder |
| -f | For fragmented genes, check the section below |
| -p | For a stronger threshold (similarity parameter), check the section below |
| -h | Display this help message |
You can also check the help message by running:
bash pandelosp.sh -hYou will get this output:
Usage: pandelosp.sh [-i input_file] [-o output_file] [-t thread_num] [-m] [-d discard_value] [-g path to gbks][-h]
Options:
-i: Input file path
-o: Output file path
-t: Number of threads
-m: Enable a different mode
-d: Discard value (0 <= d <= 1, default 0.5)
-g: Path to gbk folder
-f: For fragmented genes
-p: For a stronger threshold (similarity parameter)
-h: Display this help messageThe discard value (-d) is a threshold that is used to decide whether to compare two genes.
This type of decision is made based on the length of the genes.
NC_000913 NC_000913:NC_000913.3:b0001:1 thr operon leader peptide
MKRISTTITTTITITTGNGAG
NC_000913 NC_000913:NC_000913.3:b0018:1 regulatory protein MokC
MLNTCRVPLTDRKVKEKRAMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFTAYESE
In the case of this pair the gene identified by NC_000913:NC_000913.3:b0001:1 has a length equal to
PanDelos-plus can handle fragmented genes. To use this feature the input file must be formatted as described in the following lines.
This file must have a "2 line pattern" where:
- The first line represents the identification line, composed of 4 parts (genome identifier, the gene identifier, the gene product and the number of inferred characters) separated by a tabulation character.
- The second line consists of the complete gene sequence in FASTA amino acid format reported in a single line.
IMPORTANT No blank lines are admitted in the entire file.
Example of valid file composed of 5 genes grouped in 2 genomes for fragmented genes
NC_000913 NC_000913:NC_000913.3:b0001:1 thr operon leader peptide 5
MKRISTTITTTITITTGNGAG
NC_000913 NC_000913:NC_000913.3:b0005:1 DUF2502 domain-containing protein YaaX 20
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR
NC_000913 NC_000913:NC_000913.3:b0018:1 regulatory protein MokC 30
MLNTCRVPLTDRKVKEKRAMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFTAYESE
NC_007946 NC_007946:NC_007946.1:UTI89_RS06140:1 DUF1382 family protein 10
MHKASPVELRTSIDLAHSLAQIGVRFVPIPAETDEEFHTLATSLSQKLEMMVAKAEADERDQV
NC_007946 NC_007946:NC_007946.1:UTI89_RS06145:1 DUF1317 domain-containing protein 5
MTHPHDNIRVGAITFVYSVTKRGWVFHGLSVIRNPLKAQRLAEEINNKRGAVCTKHLLLS
IMPORTANT
Make sure that gene identifiers are unique within the input file. A suggested format to build unique gene identifier is genome_identifier:gene_identifier:unque_integer.
This feature will be used inside the following pipelines:
- PanDelos-plus-frags, not release yet
Another pipeline that will use this feature is PanDelos-frags
The similarity parameter (-p) is a threshold that is used to decide whether keep or discard a value obtained from the comparison of two genes.
This type of decision is made on the similarity of the two genes, if the similarity is greater than an specific value, the computed value is kept.
Otherwise, it is discarded.
Using the -p flag you will use an higher threshold to make the decision, so more values will not be considered.
After the computation of the clusters, you can query the pangenome using the query_pangenome.py script.
If you are using docker:
docker compose run --rm pandelosplus python3 query_pangenome.py -i <path_to_json_file>.json -o <path_to_output_folder> -c <gene_threshold_for_core> [-f < list | mutifasta | all>]If you are not using docker:
Remember to activate the virtual environment before running the script. You can do this by running the following command:
source pdp_env/bin/activate
python3 query_pangenome.py -i <path_to_json_file>.json -o <path_to_output_folder> -c <gene_threshold_for_core> [-f < list | mutifasta | all>]| Flag | Description |
|---|---|
| -i | Input file path. (.json) |
| -o | Output folder path. |
| -f | Create additional output files. Use list if you are interested only to gene identifiers; use multifasta if you are interested to get gene identifier and respective sequence in a multifasta format. If you want both you can use all. |
| -h | Display this help message. |
You can also check the help message by running:
python3 query_pangenome.py -hYou will get this output:
usage: query_pangenome.py [-h] -i INPUT -o OUTPUT -c CORE [-f {none,list,multifasta,all}]
Program to process files with output options.
options:
-h, --help show this help message and exit
-i, --input INPUT input file to process
-o, --output OUTPUT output folder for results
-c, --core CORE core threshold, the minimun number of genomes that a gene must be present in to be considered core
-f, --format {none,list,multifasta,all}
format of files to generate in addition to graphsSupposing that you have executed the example of execution with custom gbff files above and you have the following files in output folder:
custom.cluscustom.json
You can query the pangenome using the following command:
If inside docker:
docker compose run --rm pandelosplus python3 query_pangenome.py -i output/custom.json -o output/ -c 2 -f allIf outside docker:
python3 query_pangenome.py -i output/custom.json -o output/ -c 2 -f allYou will obtain the following output:
Processing file 'output/custom.json'...
Results will be saved to 'output/'
Running core analysis...
Core analysis completed successfully.
Running diffusivity analysis...
Diffusivity analysis completed successfully.
Generating presence/absence matrix...
Presence/absence matrix generated successfully.
All processing completed successfully.Now your output folder will have the following contents:
output
├── custom.clus
├── custom.json
├── diffusivity
│ ├── list
│ │ ├── diffusivity_1.txt
│ │ └── diffusivity_2.txt
│ └── multifasta
│ ├── diffusivity_1.ffn
│ └── diffusivity_2.ffn
├── gene_type.png
├── hist_family_diffusivity.png
├── matrix.csv
├── pie_family_diffusivity.png
└── types
├── list
│ ├── accessory.txt
│ ├── core.txt
│ └── singleton.txt
└── multifasta
├── accessory.ffn
├── core.ffn
└── singleton.ffnWhere:
diffusivityfolder contains the diffusivity analysis results.listfolder contains the diffusivity analysis results in a list format.multifastafolder contains the diffusivity analysis results in a multifasta format.
gene_type.pngis a plot of the gene types (core, accessory, singleton) in the pangenome.hist_family_diffusivity.pngis a histogram of the diffusivity distribution of the gene families.pie_family_diffusivity.pngis a pie chart of the diffusivity distribution of the gene families.matrix.csvis a matrix of the presence/absence of the gene families in the genomes.typesfolder contains the gene types (core, accessory, singleton) in the pangenome.listfolder contains the gene types in a list format.multifastafolder contains the gene types in a multifasta format.
For the diffusivity analysis:
For the gene_type analysis:
If you used -f list or -f all, you will find in the diffusivity/list folder the following files:
diffusivity_1.txtcontains the genes identifiers with diffusivity 1.diffusivity_2.txtcontains the genes identifiers with diffusivity 2.
If you will have use than 2 genomes, you will find more files in the
listfolder. With the formatdiffusivity_N.txt, whereNis the diffusivity of the genes inside the file.
In the diffusivity folder you will find in multifasta folder:
If you used -f multifasta or -f all, you will find in the diffusivity/multifasta folder the following files:
diffusivity_1.ffncontains the genes with diffusivity 1, in multifasta format.diffusivity_2.ffncontains the genes with diffusivity 2, in multifasta format.
If you will use more than 2 genomes, you will find more files in the
multifastafolder. With the formatdiffusivity_N.ffn, whereNis the diffusivity of the genes inside the file.
If you used -f list or -f all, you will find in the types/list folder the following files:
accessory.txtcontains the genes identifiers of the accessory genes.core.txtcontains the genes identifiers of the accessory genes.singleton.txtcontains the genes identifiers of the accessory genes.
If you used -f multifasta or -f all, you will find in the types/multifasta folder the following files:
accessory.ffncontains the accessory genes, in multifasta format.core.ffncontains the core genes, in multifasta format.singleton.ffncontains the singleton genes, in multifasta format.
PanDelos-plus is distributed under the MIT license. This means that it is free for both academic and commercial use. Note, however, that some third-party components in PanDelos-plus require you to reference certain works in scientific publications. You are free to link or use PanDelos-plus inside the source code of your own program. If you do so, please reference (cite) PanDelos-plus and this website. Bug fixes and collaboration for improvements are appreciated.
PanDelos-Plus has been presented at BBCC2024 - the 19th annual edition of the conference, November 27-29, 2024, in Naples, Italy.
Published 18 nov 2024 https://doi.org/10.7490/f1000research.1120001.1
Original PanDelos software:
Bonnici, V., Giugno, R., Manca, V.
PanDelos: a dictionary-based method for pan-genome content discovery
BMC bioinformatics 19.15 (2018): 437.
If you have used any of the PanDelos-plus project software, please cite the the following paper:


