Rechercher

Sur ce site


Documentation

SUMMARY

Installation and quick start
- Installation
- Quick start
Phylo-MCOA functions
Outlier detection functions
Graphical functions
Additional functions

Installation and quick start

  • Installation

Phylo-MCOA being totally based on R, it can be used on any machine and any platform as long as R is installed. Phylo-MCOA also requires two external packages : ape and ade4, that will be installed automatically the first time you source the Phylo-MCOA functions.

To install R, go to the R web page at http://www.r-project.org/ and follow the installation instructions :

When R is installed, download the program on the download page, extract it and save it to your working directory.

Launch R and simply type :

source("[path-to-working-directory]/pmcoa.R")

This will check if the required packages are installed and if not will propose you to install them automatically.

You can now start enjoying Phylo-MCOA. See the next section for a description of the functions and go to the example section to see examples and download datasets.

  • Quick start

If you don’t want to go through all this tutorial but want to start playing with Phylo-MCOA now, here is a quick start tutorial that you can follow. Remember that from now on we assume that a recent version of R is installed on your computer and that you have downloaded the file containing all the functions from the download page.

For simplicity, we consider that you’ve put the file containing the functions and the dataset(s) you want to analyze in the same folder. We also recommend you to start R from this folder if you are a linux user (simply type ’’R’’ in the console), or to set the folder where you have put the function as the home directory if you are a Windows or Mac user and run the GUI version of R.

Your dataset should be a list of trees in Newick format, one tree per line. These trees may or may not contain bootstrap values. We consider that the file containing your trees is called ’’mytrees.tr’’ (note that the extension name does not matter).

Here are the basic steps for running Phylo-MCOA.

Open R.
at the R prompt, type :

source("pmcoa.R") ##source all the functions  
trees<-read.tree(file="mytrees.tr") ##read the trees
step1<-pMCOA(trees,distance="nodal") ##performs the first analysis
out1<-detect.complete.outliers(step1$mat2WR, k=1.5, thres=0.5) ##detect complete outliers (if any)  ##if some outliers are present (out1$outgn or out1$oustp are not NULL)
newtrees<-rm.gene.and.species(step1$trees, out1$outsp, out1$outgn) ##remove complete outliers
step2<-pMCOA(newtrees) ##second Phylo-MCOA analysis
out2<-detect.cell.outliers(step2$mat2WR) ##detect cell by cell outliers

Note 1 : if ’’out1$outgn’’ AND ’’out1$outsp’’ are ’’NULL’’, it means that there are no « complete » outliers. In this case there is no need to use the function ’’rm.genes.and.species()’’. Simply type :

out2<-detect.cell.outliers(step1$mat2WR)

just after :

out1<-detect.complete.outliers(step1$mat2WR)

Note 2 : A single function called pMCOA.complete does all the job presented before in one step. To use it simply type :
out<-pMCOA.complete(file="mytrees.tr",distance="nodal", k=1.5, thres=0.5)
You end up with a list of all complete and cell-by-cell outliers.

The next section describes each function used by Phylo-MCOA and gives the description of inputs and outputs of each of them.

Phylo-MCOA functions

Here is a list of the functions used by Phylo-MCOA. Functions to detect outliers and graphical functions are presented later.

  • trees2matrices(trees, distance="nodal", bvalue=0)

This function converts a list of trees into a list of distance matrices. The type of distance used can be either nodal or patristic. This function also allows specifying a threshold bootstrap value under which nodes are collapsed. This allows to exclude splits in the trees with a low support.

"trees" : the list of trees of class "phylo" or "multiPhylo" that you want to analyse.
"distance" : the type of distance that is desired for transforming the trees into distance matrices. Can be either "nodal" (i.e. number of nodes separating the leaves) or "patristic" (i.e. sum of the branch lengths separating the leaves). Default to "nodal".
"bvalue" : This argument is only used if trees contain bootstrap values. It determines under what bootstrap values the nodes should be collapsed. Value 0 (the default) means that no nodes are collapsed.

  • gestion.mat(matrices)

This function takes as input the list of matrices created by ’’trees2matrices’’. It reorganizes the rows and columns and checks for missing data. If missing data are present, it computes the missing values as described in the paper. It gives as output a list of matrices without missing values and all matrices rearranged in the same way.

"matrices" : the list of pairwise distance matrices obtained from the ’’trees2matrices’’ function.

  • mat2mcoa(matrices, wtts=NULL, scannf=TRUE, nf=3)

This function does most of the Phylo-MCOA analysis. It takes as input the clean list of matrices generated by ’’gestion.mat’’ and performs the multiple co-inertia analysis on this set of matrices. It returns the coordinates of all the genes for all the species in all the dimensions of a multidimensional space, the number of dimensions considered is determined by the ’’nf’’ argument. It also gives the reference position of each species. It is also possible to specify the weight to be assigned to each tree in the Multiple Co-Inertia analysis. This weight can be, for example, the likelihood value assigned to each tree.

"matrices" : the list of pairwise distance matrices obtained from the ’’trees2matrices’’ function.
"wtts" : array of values of the same size as the number of trees specifying the weight that should be assigned to each tree (each matrix) in the MCOA analysis. Small values (absolute) represent small weights in the analysis. Values are normalized by the function prior to the analysis.
"scannf" : logical. Specifies if the user wants to be asked for the number of dimensions he/she wants to keep for the Multiple Co-Inertia analysis. If ’’TRUE’’, a plot showing the variance in the data explained by each axis is plotted and the user can choose the number of axes accordingly. If ’’FALSE’’, the number of axes is specified by ’’nf’’.
"nf" : Number of axes to be kept in the MCOA analysis. Only applicable if ’’scannf=FALSE’’.

  • mcoa2WRmat(mcoa)

This function takes as input the mcoa object created by ’’mat2mcoa’’. It computes, from the coordinates of all the species and all the genes and all the dimensions, a 2-way matrix (the 2WR-matrix) containing as many columns as the number of gene trees analyzed and as many lines as the number of species analyzed. Each cell in this matrix represents the distance, in the multidimensional space, from the position of a given species for a given gene tree to the reference position of this species. The 2WR-matrix is thus a summary of all the coordinate information carried by the MCOA object created by ’’mat2mcoa’’. See the paper for more details on this matrix.

"mcoa" : an object of class ’’mcoa’’ as created by ’’mat2mcoa’’.

  • pMCOA(trees, distance="nodal", bvalue=0, wtts=NULL, scannf=FALSE, nf="auto", gene.names=NULL)

This is the first "pipeline" function of Phylo-MCOA. It simply calls recursively all the functions described until now and returns an object with five attributes : the trees, the initial distance matrices, the clean distance matrices, the mcoa object and the 2WR matrix. The arguments that can be passed to pMCOA are those that one could pass to each one of the functions already described.

"trees" : the list of trees of class "phylo" or "multiPhylo" that you want to analyse.\\
"distance" : the type of distance that is desired for transforming the trees into distance matrices. Can be either "nodal" (i.e. number of nodes separating the leaves) or "patristic" (i.e. sum of the branch lengths separating the leaves). Default set to "nodal".
"bvalue" : This argument is only used if trees contain bootstrap values. It determines under what bootstrap values the nodes should be collapsed. Value 0 (the default) means that no nodes are collapsed.
"wtts" : array of values of the same size as the number of trees specifying the weight that should be assigned to each tree (each matrix) in the MCOA analysis. Small values (absolute) represent small weights in the analysis. Values are normalized by the function prior to the analysis.
"scannf" : logical. Specifies if the user wants to be asked for the number of dimensions he wants to keep for the Multiple Co-Inertia analysis. If ’’TRUE’’, a plot showing the variance in the data explained by each axis is plotted and the user can choose the number of axes accordingly. If ’’FALSE’’, the number of axes is specified by ’’nf’’.
"nf" : Number of axes to be kept in the MCOA analysis. Only applicable if "scannf=FALSE". If "nf=auto" (the default), the number of dimensions kept is the maximum possible : the number of trees analyzed minus 1.

The object returned by the function contains the following attributes :
"trees" All the input trees
"mat.init" Original distance matrices computed from the trees
"mat.ok" Distance matrices after correction for missing data, etc...
"mcoa" The mcoa object
"mat2WR The 2WR-matrix used for outlier detections

Outlier detection functions

Phylo-MCOA proposes two functions for detecting outliers : one for the detection of complete outliers, the other for the detection of cell-by-cell outliers.

  • detect.complete.outliers(mat2WR, k=1.5, thres=0.5)

"mat2WR" : the 2WR matrix obtained with the ’’pMCOA()’’ function.
"k" : the strength of outlier assignement (see article). the Higher this value the more relaxed the detection (more outliers detected).
"thres" : threshold above which genes or species are considered as complete outliers. 0.5 means that a gene or a species is a complete outlier if it is detected as outlier for more than 50% of the species or genes respectively.

This function returns an object with the following attributes :
"mat2WR" The 2WR matrix used to detect outliers
"thres" The threshold used
"allgn" All the genes in the 2WR matrix
"allsp" All the species in the 2WR matrix
"scoregn" The outlier score of each gene
"scoresp" The outlier score of each species
"TFgn" Logical telling for each gene if it is or not a complete outlier
"TFsp" Logical telling for each species if it is or not a complete outlier
"outgn" Array containing all the complete outlier genes detected
"outsp" Array containing all the complete outlier species detected

  • detect.cell.outliers(mat2WR, k=1.5, quiet=FALSE)

"mat2WR" : the 2WR matrix obtained with the ’’pMCOA()’’ function.
"k" : the strength of outlier assignement (see article). the higher this value the more relaxed the detection (more outliers detected).
"quiet" : logical indicating whether a warining should be printed. This warining reminds the user that this function should only be used on a matrix from which complete outliers have previously been removed.

This function returns an object with the following attributes :
"mat2WR" The 2WR matrix used to detect cell-by-cell outliers
"matspgn" The doubly normalized 2WR-matrix
"matfinal" The final binary matrix containing 1s for outliers and 0 otherwise
"outcell" All cell-by-cell outliers as a matrix with two columns. Each line represents a cell-by-cell outliers

  • pMCOA.complete(trees, distance="nodal", bvalue=0, wtts=NULL, scannf=FALSE, nf="auto", gene.names=NULL, k=1.5, thres=0.2, quiet=TRUE)

This functions does the complete job :
- First phylo-MCOA analysis
- Detection of complete outliers (if any)
- Removal of complete outliers (if any)
- Second phylo-MCOA analysis (if necessary)
- Detection of cell-by-cell outliers

This function returns an object with the following attributes :
"step1" All attributes returned by ’’pMCOA()’’ for the first analysis
"step2" All attributes returned by ’’pMCOA()’’ for the second analysis (can be the same than step1 if no complete outliers have been detected)
"outcompl" All attributes returned by ’’detect.complete.outliers()’’
"outcell" All attributes returned by ’’detect.cell.outliers()’’

Graphical functions

Phylo-MCOA provides numerous graphical functions to visualize easily and efficiently the results of the Phylo-MCOA analysis.
"plot.2WR" plots the 2WR-matrix OR all the values in the 2WR matrix in 3 barplots.
"plot.2WR.out" plots the 2WR-matrix but with outliers (complete or cell-by-cell) indicated.
"barplot.complete" plots complete outlier genes and species in barplots.
"plot.phylomcoa" plots the cohesion plot from the MCOA analysis.


  • plot.2WR(mat2WR, method="level", scale="none")

This function plots the 2WR matrix in two different ways. It can be informative to look at the complete 2WR-matrix before doing any further analysis. It gives a visual idea of the overall congruence or incongruence in the dataset.

"mat2WR" : the 2WR matrix.
"method" : The kind of plot desired. Can be "levelplot" for a levelplot or "all" for a triple barplot (see examples bellow).
"scale" : if the matrix is to be normalized to give more weight to the species (scale="species") or to the genes (scale="genes") or not normalized (scale="none"). Default set to "none".

Example 1 : Graphical output of the plot.2WR function with method="level" and scale="none", "species" and "genes" from left to right.

PNG - 119.4 ko

Example 2 : Graphical output of the plot.2WR function with method="all". Note that the top barplot is obtained after the 2WR matrix is normalized for genes, the second normalized for species and for the last the 2WR-mtrix is the product of the 2WR matrix normalized for genes and for species.

PNG - 111.2 ko

  • plot.2WR.out(OUT, mat2WR=NULL, lwd=2, lty=1, pch=19, col.lines="red", col.points="red")

This function plots the 2WR matrix and indicates the outliers (cell-by-cell or complete) in color. If the outliers are complete outliers, it draws lines representing outlier genes and species, if outliers are cell-by-cell, it draws a colored point on each outlier cell.
"OUT" : the output of one of the two outlier detection function (detect.complete.outliers and detect.cell.outliers).
"mat2WR" : If you want a 2WR matrix different from the one used to compute the outliers.
"lwd" : line width. Default is 2.
"lty" : line type. Default is 1 (straight line).
"pch" : kind of points to plot. Default is filled circles.
"col.lines" : color of the lines. Default to red.
"col.points" : color of the points. Default to red.

Example : 30 species and 30 trees, 1 complete outlier gene and 1 complete outlier species.

PNG - 108.1 ko

  • barplot.complete(OUT, col.signif="red", col.non.signif="grey")

This function plots all genes and species in two barplots with complete outliers in a different colour and the threshold used for detection of outlier represented.
"OUT" : the output of detect.complete.outliers() function.
"col.signif" : color of the bars for significant outliers.
"col.non.signif" : color of the bars for non-significant outliers.

Example : 30 genes and 30 species, 1 complete outlier species and 1 complete outlier gene.

PNG - 122.6 ko

  • plot.phylomcoa(mcoa, axe=2)

This function plots the first and second axes (if axe=2) of the cohesion plot (see the description of mcoa in the paper). Each label represents the reference position of the species and each point represents the position of each species in each one of the individual genes. Note that the total number of points of this plot is the product of the number of genes and the number of species. It can become crowded and hard to read for large datasets.
"mcoa" : the output of the mat2mcoa function.
"axe" : the second axis that should be plotted, the first one is always axis 1.
An example of an output is given below, for the dataset of Aguileta et al. 2008 used in the paper.

PNG - 469.7 ko

Additional functions

  • normalize(mat, scale="none")’’

This function normalizes the 2WR matrix (or any matrix) according to the species (rows) or to the genes (columns). Returns a normalized matrix.
"mat" : a matrix. Here, the 2WR matrix.
"scale" : Character string indicating whether the matrix should be normalized and how. If scale="none", the matrix is not normalized (the default), if scale="species", the matrix is normalized so that the difference between species is increased, and if scaled="genes", the matrix is normalized so that the difference between genes is increased. See the ’’plot.2WR’’ function for an illustration.


  • gen.trees(Ntrees, Ntiptotal, Nspmove, NbweirdGenes=0)

This function generates a list of trees with a specified number of species, of genes and of complete outliers (genes and species).
"Ntrees" : the total number of gene trees.
"Ntiptotal" : the total number of species.
"Nspmove" : the number of outlier species to simulate.
"NbweirdGenes" : the number of outlier genes to simulate. Default set to 0.


  • add.outliers(trees, nbrep)

This function introduce cell-by-cell outliers in a list of trees.
"trees" : A list of trees or an object of class ’’multiphylo’’.
"nbrep" : Number of cell-by-cell outliers to include.