You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+22-23
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ Bakta exactly identifies known identical protein sequences (**IPS**) from RefSeq
46
46
This AFSI approach substantially accellerates the annotation process by avoiding computationally expensive homology searches for identified genes. Thus, Bakta can annotate a typical bacterial genome in 10 ±5 min on a laptop, plasmids in a couple of seconds/minutes.
47
47
48
48
-**Database cross-references**
49
-
Fostering the [FAIR](https://www.go-fair.org/fair-principles) principles, Bakta exploits its AFSI approach to annotate CDS with database cross-references (**dbxref**) to RefSeq (`WP_*`), UniRef100 (`UniRef100_*`) and UniParc (`UPI*`). By doing so, IPS allow the surveillance of distinct gene alleles and streamlining comparative analysis as well as posterior (external) annotations of `putative` & `hypothetical` protein sequences which can be mapped back to existing CDS via these exact & stable identifiers (*E. coli* gene [ymiA](https://www.uniprot.org/uniprot/P0CB62)[...more](https://www.uniprot.org/help/dubious_sequences)). Currently, Bakta identifies ~214.8 mio, ~199 mio and ~161 mio distinct protein sequences from UniParc, UniRef100 and RefSeq, respectively. Hence, for certain genomes, up to 99 % of all CDS can be identified this way, skipping computationally expensive sequence alignments.
49
+
Fostering the [FAIR](https://www.go-fair.org/fair-principles) principles, Bakta exploits its AFSI approach to annotate CDS with database cross-references (**dbxref**) to RefSeq (`WP_*`), UniRef100 (`UniRef100_*`) and UniParc (`UPI*`). By doing so, IPS allow the surveillance of distinct gene alleles and streamlining comparative analysis as well as posterior (external) annotations of `putative` & `hypothetical` protein sequences which can be mapped back to existing CDS via these exact & stable identifiers (*E. coli* gene [ymiA](https://www.uniprot.org/uniprot/P0CB62)[...more](https://www.uniprot.org/help/dubious_sequences)). Currently, Bakta identifies ~350 mio, ~330 mio and ~290 mio distinct protein sequences from UniParc, UniRef100 and RefSeq, respectively. Hence, for certain genomes, up to 99 % of all CDS can be identified this way, skipping computationally expensive sequence alignments.
50
50
51
51
-**FAIR annotations**
52
52
To provide standardized annotations adhearing to FAIR principles, Bakta utilizes a versioned custom annotation database comprising UniProt's [UniRef100 & UniRef90](https://www.uniprot.org/uniref/) protein clusters (FAIR -> [DOI](http://dx.doi.org/10.1038/s41597-019-0180-9)/[DOI](https://doi.org/10.1093/nar/gkaa1100)) enriched with dbxrefs (`GO`, `COG`, `EC`) and annotated by specialized niche databases. For each DB version we provide a comprehensive log file of all imported sequences and annotations.
@@ -141,12 +141,11 @@ To download the most recent compatible database version we recommend to use the
If required, or desired, the AMRFinderPlus DB can also be updated manually:
@@ -546,22 +545,22 @@ Due due to uncertain nature of sORF prediction, only those identified via IPS /
546
545
The Bakta database comprises a set of AA & DNA sequence databases as well as HMM & covariance models.
547
546
At its core Bakta utilizes a compact read-only SQLite DB storing protein sequence digests, lengths, pre-assigned annotations and dbxrefs of UPS, IPS and PSC from:
548
547
549
-
- **UPS**: UniParc / UniProtKB (289,894,428)
550
-
- **IPS**: UniProt UniRef100 (270,638,882)
551
-
- **PSC**: UniProt UniRef90 (119,631,901)
552
-
- **PSCC**: UniProt UniRef50 (3,134,924)
548
+
- **UPS**: UniParc / UniProtKB (350,631,327)
549
+
- **IPS**: UniProt UniRef100 (330,865,009)
550
+
- **PSC**: UniProt UniRef90 (135,274,518)
551
+
- **PSCC**: UniProt UniRef50 (37,008,138)
553
552
554
553
This allows the exact protein sequences identification via MD5 digests & sequence lengths as well as the rapid subsequent lookup of related information. Protein sequence digests are checked for hash collisions while the DB creation process. IPS & PSC have been comprehensively pre-annotated integrating annotations & database *dbxrefs* from:
555
554
556
-
- NCBI nonredundant proteins (IPS: 192,288,757)
555
+
- NCBI nonredundant proteins (UPS: 290,693,966)
557
556
- NCBI COG DB (PSC: 3,513,643)
558
-
- KEGG Kofams (PSC: 19,818,290)
559
-
- SwissProt EC/GO terms (PSC: 336,656)
560
-
- NCBI NCBIfams (PSC: 17,308,678)
561
-
- PHROG (PSC: 11,243)
562
-
- NCBI AMRFinderPlus (IPS: 7,611)
563
-
- ISFinder DB (IPS: 137,670, PSC: 12,380)
564
-
- Pfam families (PSC: 687,250)
557
+
- KEGG Kofams (PSC: 24,267,514)
558
+
- SwissProt EC/GO terms (PSC: 337,264)
559
+
- NCBI NCBIfams (PSC: 21,758,901)
560
+
- PHROG (PSC: 11,717)
561
+
- NCBI AMRFinderPlus (IPS: 8,382)
562
+
- ISFinder DB (IPS: 155,449, PSC: 14,481)
563
+
- Pfam families (PSC: 659,781)
565
564
566
565
To provide high quality annotations for distinct protein sequences of high importance (AMR, VF, *etc*) which cannot sufficiently be covered by the IPS/PSC approach, Bakta provides additional expert systems. For instance, AMR genes, are annotated via NCBI's AMRFinderPlus.
567
566
An expandable alignment-based expert system supports the incorporation of high quality annotations from multiple sources. This currenlty comprises NCBI's BlastRules as well as VFDB and will be complemented with more expert annotation sources over time. Internally, this expert system is based on a Diamond DB comprising the following information in a standardized format:
@@ -577,8 +576,8 @@ An expandable alignment-based expert system supports the incorporation of high q
577
576
578
577
Rfam covariance models:
579
578
580
-
- ncRNA: 802
581
-
- ncRNA cis-regulatory regions: 270
579
+
- ncRNA: 779
580
+
- ncRNA cis-regulatory regions: 288
582
581
583
582
ori sequences:
584
583
@@ -589,13 +588,13 @@ To provide FAIR annotations, the database releases are SemVer versioned (w/o pat
589
588
590
589
As this taxonomic-untargeted database is fairly demanding in terms of storage consumption, we also provide a lightweight DB type providing all non-coding feature information but only PSCC information from UniRef50 clusters for CDS. If download bandwiths or storage requirements become an issue or if shorter runtimes are favored over more-specific annotation, the `light` DB will do the job.
All database releases are hosted at Zenodo: [](https://doi.org/10.5281/zenodo.4247252)
597
+
All database releases are hosted at Zenodo: [](https://doi.org/10.5281/zenodo.14916843)
0 commit comments