14-200x200

Bayesian Phylogenetic Infererence: Input Ready - Submission


Version:

HelpDocumentation Myexp_iconView on myExperiment

BioVeL – Biodiversity Virtual e-Laboratory

Workflow Documentation

Name:Perform Short Bayesian Phylogenetic Inference

Capacities Programme of Framework 7: EC e-Infrastructure Programme –
e-Science Environments - INFRA-2011-1.2.1
Grant Agreement No: 283359
Project Co-ordinator: Mr Alex Hardisty
Project Homepage: [http://www.biovel.eu][1]

[1]: http://www.biovel.eu

##
1 Description

The Pack contain 3 workflows that perform and validate bayesian phylogenetic inference that differ from the kind of input. The pack is called short because the worlflow require that the user need to keep taverna engine always on for the the time of the analysis. This could be quite problematic for large analysis. In this case search help in the "Perform Long Bayesian Phylogenetic Inference" pack

In this pack there are 3 workflows:

1) User locally formatted input file (for details in nexus format for mrbayes [http://mrbayes.sourceforge.net/wiki/index.php/Manual_3.2][2])

2) User is helped by a GUI (graphical user interface) to define an partitioned evolutionary model on user defined MSA (multiple sequence alignment)

3) PartitionFinder define best candidate partitioned evolutionary model based on proposed maximum set of possible partition

Note: a partitioned model is a model that allow different groups of sites (i.e. columns of the MSA) to follow the rule or parameters set of different evolutionary models. MrBayes allows that some of the parameters to be shared across partitions while other being different. The submit workflow 2 based on the GUI allow to fully take advantage of this features while the submit workflow 3 based on PartitionFinder allow to share across partition optionally branch length and always the topology

All 3 workflows perform 2 validations on the inference, one on the numerical integration (GeoKS) and one on the fit of the model (Posterior Predictive Test)

[2]: http://mrbayes.sourceforge.net/wiki/index.php/Manual_3.2

## 2 General


**2.1 Name of the workflow and myExperiment +BiodiversityCatalogue identifiers**
Name: Perform Short Bayesian Phylogenetic Inference
Download info: [http://www.myexperiment.org/packs/371.html][3]
BiodiversityCatalogue entry:

**2.2 Date, version and licensing**
Last [updated:21/02/13][4] @ 09:08:18
Version: 2
Licensing: Creative Commons Attribution ShareAlike CC-BY- SA

**2.3 How to cite this workflow**

These results come from the processing of data (personal source or others--cite which one, i.e. ENA) through BioVeL's services ([www.biovel.eu][5]). BioVeL is funded by the EU’s Seventh Framework Program, grant no. 283359. Use the article [http://journal.embnet.org/index.php/embnetjournal/article/view/557][6]

[3]: http://www.myexperiment.org/packs/371.html
[4]: http://updated:21/02/13
[5]: http://www.biovel.eu
[6]: http://journal.embnet.org/index.php/embnetjournal/article/view/557

##
3. Scientific Specifications

**3.1. Keywords:** Phylogenetic inference, Bayesian, MrBayes, Posterior Predictive test, Convergence test
**3.2. Scientific workflow description:**

The different step for a complete phylogenetic inference are in this pack divided as following:

1. Define model framework
2. Estimate parameters of the model framework with a Markovian Integration within a Bayesian framework
3. Validate the convergence of the Markovian Integration with a test on the overlap of the tree posterior distribution of two or more independent runs (GeoKS)
4. Test the adequacy or goodness of fit of the model with a posterior predictive test


**3.2.1 Define model framework:**

MrBayes allows to define partitioned model of evolution of the character of the MSA, to perform a given phylogenetic inference. A

Partitioned model is a model that allow different groups of sites (i.e. columns of the MSA) to follow the rule or parameters set of different evolutionary models.

Within MrBayes is possible to define 5 group of parameters that can be shared or not across user defined group of sites

1. Substitution Matrix
2. State frequencies
3. Site Specific rates
4. Branch lengths
5. Topology


Workflow 1 "All is ready to run". leave the user the duty to define and format a model using the criteria and tool preferred.

Workflow 2 "Select Model" based on a GUI help the user to define a model that share or not across user defined partition independently all 5 groups of parameters, taking care of the formatting.

Workflow 3 "Select Model for Me" based on PartitionFinder ([http://www.robertlanfear.com/partitionfinder/][7]) allows to define the best partitioning of the sites based on a starting maximal partitioning proposed by the user, but decision of the sharing or not across partition is done together for groups 1, 2, and 3 while branch lengths could be shared across all partition or none (no sharing across some partitions) and topology need to be shared across all sites.

[7]: http://www.robertlanfear.com/partitionfinder/

![][8] Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > PhyloInfWFColorAll-renzo.png" data-location="BioVeL User Documentation > Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > PhyloInfWFColorAll-renzo.png">

**3.2.2 Estimate parameters of the model framework with a Markovian Integration within a Bayesian framework:**

All 3 submit workflows send the input defined and formatted differently to the same service that start a MrBayes run. (ref 1,2,3)

**3.2.3 Validate the convergence of the Markovian Integration with a test on the overlap of the tree posterior distribution of two or more independent runs (GeoKS):**

The result of the MrBayes service are loaded by the retrieval workflow that perform a test of convergence based on the on the overlap of the tree posterior distribution of two or more independent runs. Details on this link

**3.2.4 Test the adequacy or goodness of fit of the model with a posterior predictive test**:

The service written in python read each estimation of set of parameters from a sub-sample of the overall posterior distribution and simulated (using evolver utility from PAML 1.4 package) new MSA. The simulated distribution is compared to the original MSA based on the sum of the sites entropy as proposed by Bollback (2002)[ref 4]

An Histogram is draw for the distribution of the complexities (log of sum of sites entropy or maximal possible loglikelihood score) of the simulated data using the posterior distribution parameters compared with observed data complexity. The 1-alpha high posterior density of the distribution show the region where simulate data complexity match the observed one. Larger observed complexity indicates model too simplistic, while the contrary indicates overparametrization of the model.

[8]: /download/attachments/8619115/PhyloInfWFColorAll-renzo.png?version=1&modificationDate=1393242192541&api=v2 (BioVeL User Documentation > Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > PhyloInfWFColorAll-renzo.png)

**3.2.5 Visualization of the consensus inference**

The newick representation of the consensus of the inference is sent to the ITOL web service ([see details at http://itol.embl.de/][9]). The service allows to visualize an interactive graphical representation of the tree. After manipulation and editing the tree can be exported in several graphical format or downloaded as newick or phyloxml format. Relaoding the tree with this format togheter with user defined annotation table allows very powerfull graphical representation of the tree (see details on [http://itol.embl.de/help/help.shtml][10])

[9]: http://itol.embl.de/
[10]: http://itol.embl.de/help/help.shtml


## 4. Technical Specifications

**4.1. Execution environment and installation requirements**

The 3 workflows are all tested on taverna workbench. They should be able to be loaded on Taverna Lite
**4.2. Taverna installation, including updates and plug-ins**

[Taverna workbench installation][11]
[Interaction plug-in][12]
**4.2.1 Taverna Dependencies**

The all 3 workflow require a local R serve to allow to draw the graph output of the Posterior Predictive Test.

The Workflow 3 based on Partition Finder, requires a python interpret installed on the path and accessible to Taverna engine. No python module is required by the script and although tested only with python 2.7 should work also with older python.

[11]: https://wiki.biovel.eu/display/doc/Taverna+Workbench
[12]: https://wiki.biovel.eu/display/doc/Customising+for+BioVeL#CustomisingforBioVeL-InstallingtheInteractionplug-in

## 5. Tutorial - how to do it

**5.1. Introduction**

The user depending of the 3 different scenarios presented in 3.2.1 will choose one of the 3 submit workflows.
**5.2. Input data**
**5.2.1. Data preparation/format**

Workflow 1 the format for the input (called"NexusFile") is nexus with data block and mrbayes block. Within the data block the MSA is specified together with the type of possible states, while in the mrbayes block the user need to define the evolutionary model and the parameters for the markovian integration ( number of generation, temperature, number of chain for the metropolis coupled part of the algorithm, and number of replicates runs to control convergence). See details in [http://mrbayes.sourceforge.net/wiki/index.php/Manual_3.2][13]

Workflow 2 and 3 require a MSA in aligned multifasta format (see [http://en.wikipedia.org/wiki/FASTA_format][14]). Recognized gap character is "-"

Workflow 3 require also a text in which user define maximum set of parts fo the alignment, meaning the maximal subdivision in group that sites could have. Mind the Partitionfinder try all combinatorial change possible and is not adivsed to propose more than 14 parts.

[13]: http://mrbayes.sourceforge.net/wiki/index.php/Manual_3.2
[14]: http://en.wikipedia.org/wiki/FASTA_format

Syntax to define partition is the following:

Each part is defined by a alpha numeric string with no space (the name) connected by an equal sign to the list of sites to be included. Description end with the semicolon sign.

A range of sites are described as start and ending sites divided by minus sign (i.e. gene1= 10-30;) with both start and end included

Discontinue sections are divided by a space (i.e. gene1= 10-30 34 40;)

Range with a step (i.e.“every third base”) are expressed with a slash (i.e. gene1= 10-30\3;)

So a complex but realistic example could be:

utr5= 1-30;

cds_pos1 = 30-200/3 400-500/3;

cds_pos2 = 31-200/3 399-500/3;

cds_pos3 = 32-200/3 398-500/3;

intron= 200-397;

utr3= 501-600;

**5.2.2. Other input
**

file name: name for the nexus file produced on the basis of partitionFinder results

number of MCMCMC generations: integer indicating number of generation to be used in the markovian integration

number of runs: integer indicating how many indipendent runs of mrbayes need to be performed ( the more the better convergence is detected

branch length are linked or unlinked across partition: see 3.2.1 for details

what criterium to be used by PartitionFinder: the criteria a AIC, AICc and BIC. All of them are information criteria. In general statistical framework AIC should be always prefferrd to AIC but in phylogenetics there are dispute on how to count constant sites. If your MSA have very few constant sites or you selected your locus randomly use AICc without doubt. For large MSA AIC and AICc give similar results.



**5.3. Select Dialogue boxes**

For each web services called a message tell the user the name and the number of the job id to the user. The message disappear after the user would push any of the buttons or if another web services is called before any action is taken. The message allows the user to know at what point of the workflow is and gives the job id number that would allow the service centre to identify the job, in case of failure.

![][15] Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > Schermata 2013-08-09 alle 15.53.43.png" data-location="BioVeL User Documentation > Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > Schermata 2013-08-09 alle 15.53.43.png">
Workflow 1 does not have specific dialog box

Worklfow 2 have a large and complex self explanatory web page. To start to use paste a aligned multifasta file in the only visible text window and click confirm. Following choice are explained by yellow bottom on the side of each question or pull down menu

Workflow 3 have following dialog box:

[15]: /download/attachments/8619115/Schermata%202013-08-09%20alle%2015.53.43.png?version=1&modificationDate=1376213644377&api=v2 (BioVeL User Documentation > Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > Schermata 2013-08-09 alle 15.53.43.png)

partition to be selected: this windows allows to choose a subset of partition to be used. At the moment is not relevant but coupling this workflow with alignment this windows could be not redundant

![][16] Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > Schermata 2013-08-09 alle 15.39.49.png" data-location="BioVeL User Documentation > Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > Schermata 2013-08-09 alle 15.39.49.png">




**5.4. Save data/results**


The 3 workflows have the following outputs:

[16]: /download/attachments/8619115/Schermata%202013-08-09%20alle%2015.39.49.png?version=1&modificationDate=1376213738892&api=v2 (BioVeL User Documentation > Define, perform and validate a Bayesian Phylogenetic Inference - Taverna engine always on or from BioVeL Portal > Schermata 2013-08-09 alle 15.39.49.png)

**Plot**

Description:

Result of the Posterior Predictive test. Histogram of the distribution of the complexities (log of sum of sites entropy or maximal possible loglikelihood score) of the simulated data using the posterior distribution parameters compared with observed data complexity. The 1-alpha high posterior density of the distribution show the region where simulate data complexity match the observed one. Larger observed complexity indicates model too simplistic, while the contrary indicates overparametrization of the model

**Newicktree**

Description:

small xml with tag res with one or more tag tree each one with a body that contain a tree in netwick format.

each tree represent the consensus for a given partition of group of partition in the same order that are cited in the nexus input file of mrbayes

**Geoks**

Description:

Result of the GeoKS test of convergence. XML format

**Phyloinferenceoutput** *

Description:

The output of the consensus service is a path were to obtain a zipped folder that includes all output from phylogenetic inference and the one of the consensus

**Viewtree***

Description:

Link to visualize consensus Tree of phylogenetic inference. It could be more than one link if the model assume more than one tree (combination topology + branch length set).

Use of the figure in the link below should be always accompagned by appropriate citation:

Letunic I and Bork P (2011) Nucleic Acids Res doi: 10.1093/nar/gkr201 Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made eas

**Mrbayesoutput***

Description:

original output of mrbayes

**Detailspptest**

Description:

Numeric Result of the Posterior Predictive test.

**Partitionfinderoutput**

Description:

Main output of PartitionFinder, where the preffered partitioned model is described

**Log**

Description:

Log file of all application used in the workflow. For each external website is reported the name of the application , the jobid required to track error on the service provvider and the last 200line of the standardouput+standardError of the application as view by the local system where the job was run.

**Partitionfinderdetails***

Description:

Path were to retrieve all the detials of the partitionfinder software

**Geoksdetails***

Description:

Path were to get all details of convergence test calculation.



Ouput with * are also exposed on a downloadable link in the web browser as soon as they are produced
**5.5. Results analysis**

If the GeoKS test fail, the p-value is lower of the risk accepted by the user, the workflow stop and do not perform the posterior predictive test and the tree consensus calculation. User is invited to look at GeoKS result and increase the generation number of at least 1.5 times.

If the GeoKS test do not fail, the p-value is higher of the risk accepted by the user, the user is invited to check that no more generation are advised by GeoKS. In fact the program advise the user for more generation if less that 300 Effective Sample Size (ESS) Tree are present in the posterior distribution. With less than 300 ESS trees the test do not guarantee to have sufficient power to detect correctly convergence. Then the user should look at the plot output and check if the red line (observed MSA complexity) is within the HSD of the distribution. If the red line is on the right of the distribution a more complex model need to be taken in consideration. if the red line is on the left of the distribution it could be than a more simple model could be entertained, and branch support are excessively conservative.

Once the two tests are positive the user could inspect the consensus tree on ITOL, decorate at will and print on file the tree for a publication. The naked netwick tree could be also extracted using the option save tree from ITOL or directly from the output port newick

## 6. Support


For questions with using the workflow, please write [support@biovel.eu][17].
For definitions of technical and biological terms, please visit the BioVeL glossary page: [https://wiki.biovel.eu/display/BioVeL/Glossary][18]

[17]: mailto:support@biovel.eu
[18]: https://wiki.biovel.eu/display/BioVeL/Glossary

## 7. Bibliography



1. Huelsenbeck JP, Larget B, Miller RE, Ronquist F. Potential applications and pitfalls of Bayesian inference of phylogeny. Systematic biology. 2002;51(5):673–88. Available at: [http://sysbio.oxfordjournals.org/content/51/5/673.abstract][19] [Accessed January 29, 2013].

2. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics (Oxford, England). 2001;17(8):754–5. Available at: [http://www.ncbi.nlm.nih.gov/pubmed/11524383][20] [Accessed January 29, 2013].

3. Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics (Oxford, England). 2004;20(3):407–15. Available at: [http://www.ncbi.nlm.nih.gov/pubmed/14960467][21].

4. Bollback JP. Bayesian Model Adequacy and Choice in Phylogenetics. Molecular biology and evolution. 2002;19(7):1171–1180.

[19]: http://sysbio.oxfordjournals.org/content/51/5/673.abstract
[20]: http://www.ncbi.nlm.nih.gov/pubmed/11524383
[21]: http://www.ncbi.nlm.nih.gov/pubmed/14960467

Inputs (3)
Outputs (3)
Interactions (0)
R Scripts (0)
Details

Related runs

Eu_logo
BioVeL has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 283359.

Portal version: 1.6.1-61c993b