The software package SiLiX implements an ultra-efficient algorithm for the clustering of homologous sequences, based on single transitive links (single linkage) with alignment coverage constraints.
SiLiX is now incorporated in hundreds of projects in the world (in particular in France) and is widely cited.
SiLiX adopts a graph-theoretical framework to interpret similarity pairs as edges of a network. A very efficient algorithm, based on the Disjoint Sets Data Structure, allows the computation of sequence families with low time and space requirements.
A parallel version of SiLiX, based on MPI, is also available in this package and has been proved to be scalable, so that its allows the study of very large datasets.
SiLiX is already included in the analysis pipeline for HOGENOM.
SiLiX is licensed under the General Public License.
You can download the latest version HERE.
Alternatively, you can clone our git repository HERE.
SYSTEM REQUIREMENTS & DEPENDENCIES
SiLiX is written in ANSI C++ and has been tested on Linux and MacOsX.
* The C++ Boost:program_options package (include files AND shared library) must be installed. This is free and easy to install on every systems with one of the following procedures :
- Getting the Debian package on Ubuntu/Debian Linux :
sudo apt-get install libboost-dev
sudo apt-get install libboost-program-options-dev
- Building and installing from source on other Linux or MacOSX : download Boost and follow this guideline (replace xx_x by current version number) :
tar zxvf boost_1_xx_x.tar.gz
sudo ./b2 install
- Installing Boost on MacOSX with MacPorts
- To maximize performance, the C++ Boost:unordered_map class must be installed
- For the unit tests performed during the checking, CppUnit must be installed.
- To enable parallelism, the MPI library must be installed (tested with openMPI).
Compilation and installation are compliant with the GNU standard procedure
tar zxvf silix-1.x.x.tar.gz
but additional optional configure options :
- enabling the use of MPI library and switch to the parallel version of SiLiX
- enabling the use of Boost:unordered_map class
- specifying a path where the programs must be installed
- verbose mode (not important). Enabling the output of % identity and % coverage information in the file created with —net option
Two programs with man pages are available :
- Filtering+Clustering : The user provides a fasta file and the result file(s) of a all-against-all BLAST search in tabular format ( -outfmt 6 option in blastall, i.e. query id, subject id, percent identity, alignment length, number of mismatches, number of gap openings, query start, query end, subject start, subject end, Expect value, HSP bit score)
silix [OPTIONS] <FASTAFILE> <BLASTFILE>084D4F
To get information or help :
- Clustering : The user provides an input a list of pairs of sequence IDs
silixx <NB> <FILE>
To get information or help :
Running the parallel version of silix
First, the user must have a collection of N blast result files to be processed in parallel.
After having use ./configure with the option —enable-mpi, the user must adopt the classical way to run a program using MPI :
mpirun -np NP silix [OPTIONS] <FASTAFILE> <MULTIBLASTFILE>
with NP the chosen number of processors (in practise, NP<=N)
Many users contact us because they got the following error :
Error in Fam : unable to open file xxxx
where "xxxx" is the id of the first sequence in the blast output file.
What does this error mean ? If you configure silix with the option "—enable-mpi" (see previous section), then the command line interface is slightly different : a MULTIBLASTFILE is expected, not a BLASTFILE !
CLASSICAL SKETCH (important changes since versions 1.2.x)
In the following, we use auxiliary programs that are in the utils/ directory of the package, but not installed.
- Blasting all versus all
formatdb -i seq.fasta -n seq.db
blastp -db seq.db -query seq.fasta -outfm 6 -out blastall.out
or for older versions of Blast :
blastall -p blastp -d seq.db -i seq.fasta -m 8 -o blastall.out
- Running silix. Requires fasta files. The options are the filtering parameters, with the following default values (see Penel et al, BMC Bioinformatics, 2009) :
two sequences in a pair are included in the same family if remaining HSPs (Homologous Segment Pairs) cover at least 80% of the
protein length and if their similarity is over 35% ; a partial sequence is included if its length is ≥100 amino-acids or ≥50% of the length of the complete protein.
silix seq.fasta blastall.out -f FAM > seq.fnodes
Here, we specified a prefix "FAM" for the family ids.
Nota Bene: For running the parallel version, the user displays the list of blast results files to be processed in parallel
mpirun -np 4 silix seq.fasta filenames.txt -f FAM > seq.fnodes
where "filenames.txt" is
- Retrieving family sizes
utils/silix-fsize seq.fnodes > seq.fsize
- Splitting sequences in multiple fasta files
silix-split seq.fasta seq.fnodes
- With —net option, a file blastall.net is created and contains all the pairs taken into account after filtering. A message is sent to STDERR :
".net" are the extension for files of format
".fnodes" are the extension for files of format
SiLiX is developed by :
- Laurent Duret
- Vincent Miele
- Simon Penel
If you use SiLiX in a published work, please cite the following reference :
Miele,V., Penel, S. and Duret,L., Ultra-fast sequence clustering from similarity networks with SiLiX, BMC Bioinformatics
For any bugs, information or feedback, please contact :