GECO Evolutionary and Computational Genomics
Bioinformatics, Phylogeny and Evolutionary Genomics Group
Flandrois Jean-Pierre
Professeur d'université émérite
UCBL
A Major Project: Bacteria and Archaea Huge Protein Databases Automated Construction
riboDB is a model database
riboDB is a database gathering high-quality sequences of 93 families of proteins linked to the structure and function of the ribosome of Bacteria and Bacteria.
The riboDB DB (July 2025) contains data from 261183 genomes (15725 Archaea, 245458 Bacteria) and 13356907 ribosomal-proteins sequences.
The RiboDB web-site is the interface to query this database and get a relevant dataset adapted to specific phylogenomics studies.
Building this database is a complex problem that has been the source of many improvements in automated protein database rapid construction.
Protein databases construction process
The riboDB approach (and engine) is versatile and can be used to construct ribosomal proteins databases from freshly assembled and annotated genomes but also to build other proteins-families databases.
From the initial approach, there has been evolutions to manage the heavy increase of available genomes.
The new version of the riboDB construction program is since 2025 written in pure Julia and this led to a 20x speeds-up of the construction process and reduction of the storage needs by 3 compared to the previous Python version. This is the Magiotagos project.
The key idea
- To represent a genome as tree-like Julia structures (type "Struct"), sequences being saved only as LongSequence{DNAAlphabet{4}} to compact the information. Any relevant information is either extracted from the assembly_summary files, GFF (locally produced or from GenBank) or pre-computed during the construction. New gathered informations may be added in new sub-structures (for exemple new identification of proteins). Other more specialized information (i.e protein 3D structure) may be held in other databases and interconnection is directly written in the Julia structure.
- To save these genome in binary format in a high number of directories to speed-up parallelism set-up. Due to the native parallelization approach the whole set of directories is seen as a unique source of data. The balance between the cores relies only on randomization which is relevant due to the high number of genomes.
- Most of the precomputed informations and protein sequences can be dropped as the power and speed of Julia enables computing and sequence traduction on the fly. This simplify and lighten the database without impairing its use.
In our team, this organisation is known as a "Knowledge Base" as all the knowledge about the genomes is contained in the imbricated structures or may be quickly reconstructed from it.
Building the Knowledge Base
The first step is to collect the genomes. Downloading from NCBI sources is the longest step, even if the downloading is lightly parallelized, due to network speed and limitations of the download by the NCBI server. Around 400,000 genomes are downloaded. This collection step can also include genomes obtained by de novo sequencing and/or de novo annotation.
The second step is to construct the Julia structure corresponding to the genome model.
Once the Knowledge Base available, annotation of the ribosomal proteins (and others) and any use of the organized data becomes possible.
Building proteins databases: de novo annotation of protein families
Building the ribosomal-proteins database
De novo annotation of the CDS sequences is done by using our own HMM set of ribosomal-proteins profiles. The candidate proteins are then submitted to a quality-control by using a MMSEQS clustering that include expert-selected reference sequences. Sequences from a cluster where references are also found are validated. The sequences occuring twice or more in a genome are separated in the "multiples" sets. The sequences occuring twice or more in a genome are separated as "multiple". The new identification of the CDS is then added to the Knowledge Base.
Other protein families
The same process may also used to built any set of proteins families within minutes as long as the HMM profiles are available along with references sequences.
rDNA databases
Currently RiboDB contains also the rDNA if available in the genomes (in the common case of multiples operons only one rDNA is retained on the basis of its centrality).
The PkXplore web-site contains these rDNA (16S,23S,5S) sequences and some frequently nucleic sequences databases of proteins like DNA-directed RNA polymerase β (rpoB) or groEL and groES chaperonins. Thanks to a human-friendly web interface it allows the phylogenetic placement of unknown nucleic acid sequences. It is oriented toward learning of the basic methods of phylogeny.
Phylogeny
From "Knowledge Base" to protein family construction
In the Knowledge Base the original annotation (from RefSeq or GenBank) of CDS or to the new de novo annotation can be used to collect sequences bearing the same name. This is easy for ribosomal proteins or the re-annotated proteins, it may be more complicated when only the original annotation is available because the same protein may be identified by various names. In some case, essentially for universal or largely shared proteins, the multiple identifications may be used (for exemple "ChaperoninGroES" "CoChaperoneGroES") at the same time to extract the protein family. This direct use of the initial annotation is however dangerous without a strict control.
From "Knowledge Base" to Core proteins
Paradoxically a collection of "core" proteins i.e proteins shared by a high percentage of closely-related genome is rather easy and quick to create by using the Knowledge Base. This is possible because the annotation of the same proteins is more stable at short taxonomic range and because the ribosomal proteins are the main component of the core genome.
The Yggdrasil project : automated preparation of super-matrix MSA
The extracted families can be directly and quickly treated by Yggdrasil to produce a multiple sequence alignment of concatenated sequences of the same genome. The process associate within-family alignment, trimming and creation of the super-alignment (super-matrix) . It may also produce a control tree. The super-matrix is directly usable to reconstruct the phylogeny of the genome collection by adapted methods.
This is a Python program, deeply optimized with a process parallelization.
Web Applications
PkXplore
The PkXplore web-site is an online workshop oriented toward the learning of the basis of phylogeny, but it may also be used as a phylogeny explorer for unknown Bacteria and Archaea.
See PkXplore (GitHub) to install a local version. MSGlimpse (GitHub) is the multi-sequences viewer developed for the project.
TCPriboDB and riboDB
TCPriboDB (GitHub) is the TCP server used by the riboDB web-site (GitHub). TCPriboDB uses a system for representing sequences in dictionary form. Another solution would have been the use of a traditional DBMS. There are multiple reasons for not choosing the DBMS option, but the main idea is that the TCP server is a solution that can be used whenever a Knowledge Base (the new representation of genomes) is created to allow local queries and is not limited to riboDB. Requests to TCPriboDB may be done within a local network by sending a structured sentence and it send back a set of files containing the response (extracted sequences corresponding to the key-words). It has been developed for the riboDB web-site but may be used for any set of proteins families. The databases are easy to built and share as long as the riboDB Fasta grammar is respected.
the riboDB web-site (GitHub) is the user-friendly interface to the TCP-server. Being developed initially for the riboDB project it may be used as a web-server for any collection of protein-families exploration. It enables selection and extraction of proteins-families sequences from Bacteria and Archaea with explicitly written or partial names. Its use is not limited to ribosomal proteins as long as a dedicated TCP-server for other proteins has been set-up.
Publications
Display of 1 to 30 publications on 138 in total
Multi-proteins similarity-based sampling to select representative genomes from large databases
BMC Bioinformatics . 26 ( 1 ) : 121
Journal article
see the publication14C dating of tsunami deposits in arid environments: How challenging can it be? The example of La Graciosa, Canary Islands
Marine Geology . 488 : 107607
Journal article
see the publicationThe Inca child of the Quehuar volcano: Stable isotopes clue to geographic origin and seasonal diet, with putative seaweed consumption
Journal of Archaeological Science: Reports . 59 : 104784
Journal article
see the publicationDiversité et diversification des Pectobacteriaceae, une famille de bactéries phytopathogènes d’importance mondiale
Cahiers Scientifiques de la Fondation Pierre Vérots . ( 10 ) : 19-29
Journal article
see the publicationDescription of a new genus of the Pectobacteriaceae family isolated from water in coastal brackish wetlands of the French Camargue region, Prodigiosinella gen. nov., including the new species Prodigiosinella aquatilis sp. nov
Systematic and Applied Microbiology . 47 ( 2-3 ) : 126497
Journal article
see the publicationGeographic origin and social status of the Gallic warriors from Ribemont‐sur‐Ancre (France) studied through isotope systematics of bone remains
International Journal of Osteoarchaeology . 33 ( 1 ) : 39-50
DOI: 10.1002/oa.3172
Journal article
see the publicationMitigation of the diagenesis risk in biological apatite δ18O interpretation
Palaeogeography, Palaeoclimatology, Palaeoecology . 630 : 111812
Journal article
see the publicationδ2H and δ18O of river water from a high-altitude humid plain of the southern Alps: Implications for the interpretation of the isotopic compositions of bioapatite from humans living close to mountain areas
Journal of Archaeological Science: Reports . 49 : 104020
Journal article
see the publicationHydrogen isotope measurements of bone and dental tissues from archaeological human and animal samples and their use as climatic and diet proxies
Journal of Archaeological Science . 147 : 105676
Journal article
see the publicationClimate conditions and dietary practices during the Second Iron Age studied through the multi-isotope analysis of bones and teeth from individuals of Thézy-Glimont, Picardie, France
Archaeological and Anthropological Sciences . 14 ( 4 ) : 61
Journal article
see the publicationA divide-and-conquer phylogenomic approach based on character supermatrices resolves early steps in the evolution of the Archaea
BMC Ecology and Evolution . 22 ( 1 ) : 1-12
Journal article
see the publicationLes isotopes de l'hydrogène contenus dans les tissus osseux et dentaires de populations archéologiques et leurs usages comme proxys paléoclimatiques et paléoalimentaires
27e édition de la Réunion des Sciences de la Terre .
Conference paper
see the publicationA Comprehensive Evolutionary Scenario of Cell Division and Associated Processes in the Firmicutes
Molecular Biology and Evolution . 38 ( 6 ) : 2396-2412
Journal article
see the publicationClimatic change and diet of the pre-Hispanic population of Gran Canaria (Canary Archipelago, Spain) during the Medieval Warm Period and Little Ice Age
Journal of Archaeological Science . 128 : 105336
Journal article
see the publicationA Sample-to-Report Solution for Taxonomic Identification of Cultured Bacteria in the Clinical Setting Based on Nanopore Sequencing
Journal of Clinical Microbiology . 58 ( 6 ) : 1128
DOI: 10.1128/JCM.00060-20
Journal article
see the publicationDickeya poaceiphila sp. nov., a plant-pathogenic bacterium isolated from sugar cane (Saccharum officinarum)
International Journal of Systematic and Evolutionary Microbiology . 70 ( 8 ) : 4508-4514
Journal article
see the publicationδ18O and δ13C of diagenetic land snail shells from the Pliocene (Zanclean) of Lanzarote, Canary Archipelago: Do they still record some climatic parameters?
Journal of African Earth Sciences . 162 : 103702
Journal article
see the publicationIsotopic systematics point to wild origin of mummified birds in Ancient Egypt
Scientific Reports . 10 ( 1 )
Journal article
see the publicationThe Gauls experienced the Roman Warm Period: Oxygen isotope study of the Gallic site of Thézy-Glimont, Picardie, France
Journal of Archaeological Science: Reports . 34 : 102595
Journal article
see the publicationMetapopulation ecology links antibiotic resistance, consumption, and patient transfers in a network of hospital wards
eLife . 9
DOI: 10.7554/eLife.54795
Journal article
see the publicationTaxonomic assignment of uncultured prokaryotes with long range PCR targeting the spectinomycin operon
Research in Microbiology . 170 ( 6-7 ) : 280-287
Journal article
see the publicationP2857 Quantifying drivers of antimicrobial resistance in a large network of hospital wards: a meta-population approach.
29TH ECCMID .
Conference paper
see the publicationMetapopulation ecology links antibiotic resistance, consumption and patient transfers in a network of hospital wards.
DOI: 10.1101/771790
Preprint
see the publicationThe phytopathogenic nature of Dickeya aquatica 174/2 and the dynamic early evolution of Dickeya pathogenicity
Environmental Microbiology . 21 ( 8 ) : 2809-2835
Journal article
see the publicationHighly Reduced Genome of the New Species Mycobacterium uberis, the Causative Agent of Nodular Thelitis and Tuberculoid Scrotitis in Livestock and a Close Relative of the Leprosy Bacilli
MSphere . 3 ( 5 ) : e00405-18
Journal article
see the publicationChanging patterns of human migrations shaped the global population structure of Mycobacterium tuberculosis in France
Scientific Reports . 8 ( 1 ) : 5855
Journal article
see the publicationTsunami sedimentary deposits of Crete records climate during the ‘Minoan Warming Period’ (≈3350 yr BP)
The Holocene .
Journal article
see the publicationd18O-Derived incubation temperatures of oviraptorosaur eggs.
Palaeontology . : 1-15
DOI: 10.1111/pala.12311
Journal article
see the publicationRecord of Nile seasonality in Nubian neonates.
Isotopes in Environmental and Health Studies . 53 ( 3 ) : 223-242
Journal article
see the publication