Rechercher

Sur ce site


Documentation

UrQt (Unsupervised read Quality trimming) is a fast C++ software to trim nucleotides of unreliable quality from NGS data in fastq or fastq.gz format (the file’s format is automatically detected). For the phred score encoding, the default is 33 = Sanger (ASCII 33 to 126), but this can be modified with the option —phred to set for example 64 = Illumina 1.3 or 59 = Solexa/Illumina 1.0.

Single-end

To use UrQt on a single-end fastq of fastq.gz file, use the following command :

Both input and output files must be accessible and writeable by UrQt to prevent errors.

Paired-end

To use UrQt on a paired-end fastq of fastq.gz file, use the following command :

By default UrQt removes empty reads (i.e. reads with zero nucleotides of good quality), and keeps the link between the paired-end files.

Note that we recommend to use the option —gz and output your file in fastq.gz for significant gains of disk space.

Quality threshold

The quality threshold parameter —t threshold defines the minimum phred score above which a phred score is considered as "good quality".
By default UrQt uses a phred of 20 but this can be changed with the option —t threshold.
The classical definition of the quality threshold is obtained with —t 3.0103.
Note that UrQt will not remove all the bases with a phred score below —t, but will find the best segmentation between two segments of "bad quality" framing a segment of "good quality".
This parameter is independent of the data and must be chosen accordingly to the goal of the analysis.

Example for setting a threshold of 10 :

Homopolymer trimming

With the option —N letter you can define the poly-nucleotide to trim at the head or tail of the sequences.
For letters not present in the standard IUB/IUPAC dictionary, UrQt will perform QC trimming instead of poly-nucleotide trimming.

Example for trimming polyA at the head and tail of the reads :

Verbose mode

By default UrQt displays a minimal amount of information. If you want you can use the option —v to display all the options used, progress bars and time left.

Multi-threading

By default UrQt uses 3 threads (main plus two sub-threads) for a total CPU usage of 100% of one processing unit. You can use the option —m thread_number to use more than one processing unit.
Each additional thread will use a new processing unit.

To run UrQt on 10 processing units :

Head or Tail

By default, UrQt starts by finding the best cut-point k_1 \in [1,l] with l the size of the read, between a segment of "good quality" and a segment of "bad quality", and then finds the best cut-point k_2 \in [1,k_1] between a segment of "bad quality" and a segment of "good quality".
Instead of —pos both, one can use the parameter —pos head to trim only the reads’ head or —pos tail to trim only the reads’ tail.

Example to trim the reads’ head and tail :

Example to trim only the reads’ head :

Example to trim only the reads’ tail :

Minimum trimmed read size

You can tell UrQt to only report reads with a size superior to n nucleotides with the option —min_read_size n

Example to report only reads with a size superior to 15 nucleotides :

Maximum number of nucleotides trimmed

You can constrain UrQt to remove no more than n nucleotides at the tail of the reads with the option —max_tail_trim n.
The complementary option for the head of the reads is —max_head_trim n.

Example to trim at maximum 10 nucleotides at the head of the reads :

Example to trim at maximum 10 nucleotides at the tail of the reads :

Example to trim at maximum 10 nucleotides at the head and at the tail of the reads :

By default empty reads are removed from the output, you can keep them with the option —r.

Classical filter

You can tell UrQt to only keep reads with a minimum of x percent of their length above y phred with the two following options :—min_QC_length x and —min_QC_phred.
This filter will be applied after the trimming procedure of UrQt.

For example to retain only reads with a phred of more than 20 on 80% of their length after trimming :

Nucleotides probability computation

By default, UrQt use the EM algorithm to compute the proportion of the 4 different nucleotides in a read and estimate the different cut-points.
You can tell UrQt to use fixed proportions of 1/4 for each nucleotides with the option —S.
To compute the proportion of each nucleotide on a sample of size n reads, you can use the option —s n.

These two option speed-up the computation but we recommend to use the default parameters (no parameter) for a better estimate of the cut-points in the reads.