A Major Project: Bacteria and Archaea Huge Protein Databases Automated Construction

riboDB is a model database

riboDB is a database gathering high-quality sequences of 93 families of proteins linked to the structure and function of the ribosome of Bacteria and Bacteria.
The riboDB DB (July 2025) contains data from 261183 genomes (15725 Archaea, 245458 Bacteria) and 13356907 ribosomal-proteins sequences.
The RiboDB web-site is the interface to query this database and get a relevant dataset adapted to specific phylogenomics studies.

Building this database is a complex problem that has been the source of many improvements in automated protein database rapid construction.

Protein databases construction process

The riboDB approach (and engine) is versatile and can be used to construct ribosomal proteins databases from freshly assembled and annotated genomes but also to build other proteins-families databases.
From the initial approach, there has been evolutions to manage the heavy increase of available genomes.
The new version of the riboDB construction program is since 2025 written in pure Julia and this led to a 20x speeds-up of the construction process and reduction of the storage needs by 3 compared to the previous Python version. This is the Magiotagos project.

The key idea

  1. To represent a genome as tree-like  Julia structures (type "Struct"), sequences being saved only as LongSequence{DNAAlphabet{4}} to compact the information. Any relevant information is either extracted from the assembly_summary files, GFF  (locally produced or from GenBank)  or  pre-computed during the construction. New gathered informations may be added in new sub-structures (for exemple new identification of proteins). Other more specialized information (i.e protein 3D structure) may be held in other databases and interconnection is directly written in the Julia structure.
  2. To save these genome in binary format in a high number of directories to speed-up parallelism set-up. Due to the native parallelization approach the whole set of directories is seen as a unique source of data. The balance between the cores relies only on randomization which is relevant due to the high number of genomes. 
  3. Most of the precomputed informations and protein sequences can be dropped as the power and speed of Julia enables computing and sequence traduction on the fly. This simplify and lighten the database without impairing its use. 

In our team, this organisation is known as a "Knowledge Base" as all the knowledge about the genomes is contained in the imbricated structures or may be quickly reconstructed from it.

Building the Knowledge Base

The first step is to collect the genomes. Downloading from NCBI sources is the longest step, even if the downloading is lightly parallelized, due to network speed and limitations of the download by the NCBI server. Around 400,000 genomes are downloaded. This  collection step can also include genomes obtained by de novo sequencing and/or de novo annotation.  

The second step is to construct the Julia structure corresponding to the genome model.  

Once the Knowledge Base available, annotation of the ribosomal proteins (and others) and any use of the organized data becomes possible. 

Building proteins databases: de novo annotation of protein families

Building the ribosomal-proteins database

De novo annotation of the CDS sequences is done by using our own HMM set of ribosomal-proteins profiles. The candidate proteins are then submitted to a quality-control by using a MMSEQS clustering that include expert-selected reference sequences. Sequences from a cluster where references are also found are validated. The sequences occuring twice or more in a genome are separated in the "multiples" sets. The sequences occuring twice or more in a genome are separated as "multiple". The new identification of the CDS is then added to the Knowledge Base.

Other protein families

The same process may also used to built any set of proteins families within minutes as long as the HMM profiles are available along with references sequences.

rDNA databases

Currently RiboDB contains also the rDNA if available in the genomes (in the common case of multiples operons only one rDNA is retained on the basis of its centrality).
The PkXplore web-site contains these rDNA (16S,23S,5S) sequences and some frequently nucleic sequences databases of proteins like DNA-directed RNA polymerase β (rpoB) or groEL and groES chaperonins. Thanks to a human-friendly web interface it allows the phylogenetic placement of unknown nucleic acid sequences. It is oriented toward learning of the basic methods of phylogeny.

Phylogeny

From "Knowledge Base" to protein family construction

In the Knowledge Base the original annotation (from RefSeq or GenBank)  of CDS or to the new de novo annotation can be used to collect sequences bearing the same name.  This is easy for ribosomal proteins or the re-annotated proteins, it may be more complicated when only the original annotation is available because the same protein may be identified by various names.  In some case, essentially for universal or largely shared proteins, the multiple identifications may be used (for exemple "ChaperoninGroES"  "CoChaperoneGroES") at the same time to extract the protein family. This direct use of the initial annotation is however dangerous without a strict control.

From "Knowledge Base" to Core proteins

Paradoxically  a collection of "core" proteins i.e proteins shared by a high percentage of closely-related genome is rather easy and quick to create by using the Knowledge Base. This is possible because the annotation of the same proteins is more stable at short taxonomic range and because the ribosomal proteins are the main component of the core genome. 

The Yggdrasil project : automated preparation of super-matrix MSA

The extracted families can be directly and quickly treated by Yggdrasil to produce a multiple sequence alignment of concatenated sequences of the same genome. The process associate within-family alignment, trimming and creation of the super-alignment (super-matrix) . It may also produce a control tree. The super-matrix is directly usable to reconstruct the phylogeny of the genome collection by adapted methods.

This is a Python program, deeply optimized with a process parallelization.

Web Applications

PkXplore

The PkXplore web-site is an online workshop oriented toward the learning of the basis of phylogeny, but it may also be used as a phylogeny explorer for unknown Bacteria and Archaea.
See PkXplore (GitHub) to install a local version. MSGlimpse (GitHub) is the multi-sequences viewer developed for the project.

TCPriboDB and riboDB

TCPriboDB (GitHub) is the TCP server used by the riboDB web-site (GitHub). TCPriboDB uses a system for representing sequences in dictionary form. Another solution would have been the use of a traditional DBMS. There are multiple reasons for not choosing the DBMS option, but the main idea is that the TCP server is a solution that can be used whenever a Knowledge Base (the new representation of genomes) is created to allow local queries and is not limited to riboDB. Requests to TCPriboDB may be done within a local network by sending a structured sentence and it send back a set of files containing the response (extracted sequences corresponding to the key-words). It has been developed for the riboDB web-site but may be used for any set of proteins families. The databases are easy to built and share as long as the riboDB Fasta grammar is respected.
the riboDB web-site (GitHub) is the user-friendly interface to the TCP-server. Being developed initially for the riboDB project it may be used as a web-server for any collection of protein-families exploration. It enables selection and extraction of proteins-families sequences from Bacteria and Archaea with explicitly written or partial names. Its use is not limited to ribosomal proteins as long as a dedicated TCP-server for other proteins has been set-up.

 

Publications

Display of 1 to 30 publications on 138 in total