Phylogenetic Analysis I

Phylogenetic Analysis

Phylogenetic methods can be used for many purposes, including analysis of morphological and several kinds of molecular data. We concentrate here on the analysis of DNA and protein sequences.

Comparisons of more than two sequences

Analysis of gene families, including functional predictions

Estimation of evolutionary relationships among organisms

The basic concepts of phylogenetic analysis are quite easy to understand, but understanding what the results of the analysis mean, and avoiding errors of analysis can be quite difficult. For detailed coursework you can take my graduate class on the topic.

COG analysis

A "quick and dirty" substitute for phylogenetic analysis

Using BLAST for multiple sequence comparisons

Emphasis is on reciprocal best hits, particularly among three genomes

This is probably an OK way to identify homologs, but it does not have the power of full phylogenetic analysis

Example with Everyday Objects

The basic model of phylogenetic analysis.

Nearly all methods of phylogenetic analysis share a number of fundamental assumptions. These include:

Homologous sequences are in a multiple sequence alignment.

• Note that homology is an a priori assumption of most phylogenetic methods. If homology is uncertain, then the analytical results should be interpreted with great caution.

The alignment is also referred to as a data matrix

Each column in the alignment is referred to as a character.

The specific residue (nucleotide or amino acid) present in a given sequence is referred to as the character state.

They are assumed to have been derived from a single common ancestor (this statement is actually redundant; by definition homologous sequences must be derived from a common ancestor).

In most cases ancestral sequences are not known, and the ancestral states must be inferred

The ancestral sequences are assumed to have undergone mutation

Modeling mutation accurately is one of the challenges of phylogenetic analysis

They are assumed to be related by a dichotomously branching tree

A priori assumptions include (but are not necessarily limited to):

Accuracy of sequence

That the sequence itself is correct

That it was determined from the correct organism

Violations of this assumption are more common than one might suspect. Several kinds of laboratory errors can result in incorrect annotation of an otherwise legitimate sequence.

That homology has been correctly determined. This applies to both the sequences themselves and the alignment.

Paralogy can cause tremendous confusion.

The assumptions that went into making the multiple sequence alignment are among the assumptions of the phylogenetic analysis that is based on that alignment.

That sufficient similarity remains among the sequences that there is usable phylogenetic information present.

The assumptions of phylogenetic analysis described above

Other critical considerations

The information content of the sequences

Invariant sequences

Saturated sequences

Assumptions particular to the analytical method (this will constitute much of our discussion for the next few lectures)

Markov Model

Note that even if a gene phylogeny is correctly inferred, that phylogeny may not be helpful. For example, because of paralogy, hybridization, introgression, and horizontal gene transfer, gene phylogenies do not always correspond to the phylogeny of the genome as a whole.

The data matrix

Characters

Character states

Multiple sequence alignments as data matrices

The importance of homology assessment

Phylogenetic methods can be divided into three general categories

Parsimony

Minimum Distance

Likelihood

Optimality criteria vs. tree-building algorithms

Parsimony

Part of a larger theoretical system refered to as "Cladistics"

Emphasises shared derived character states

The idea is that monophyletic groups can be recognized because they share derived character states ("synapomorphies").

Invariant, unique ("autapomorphic"), and ancestral character states are considered to be uninformative

Search for the tree that requires the smallest number of character-state changes

Determining the length of a tree

Minimum number of steps for a given character can be determined in one pass

We will look at a simple case with unordered characters

Assign a state to each terminal node
(2) Visit first internal node
1. is the intersection of states non-empty?
  1. Yes: set internal state to this.
  2. Else:
    1. set the state to the smallest set containing the states of the daughter nodes
    2. increase the tree length by 1.
Are you at the root of the tree?
1. No: go to 2.
2. Yes: go to 4.
(4) Is the state at this node the same as the outgroup state?
1. Yes: Proceed to the next character
2. Else: Add one to the length of the tree; proceed to next character

This tells you the tree length, but does not map the characters onto the tree

Determining a most parsimonious reconstruction requires another pass

This reconstruction will not necessarily be unique!

The problem with uncorrected methods

Parsimony is easy to understand and can be a useful analytical method, but the method makes some assumptions that may not be immediately obvious. One of parsimony's most important assumptions is that it is relatively unusual for identical character-states to appear independently in different parts of the phylogenetic tree. In other words, it assumes that convergent evolution is a relatively rare phenomenon.

Unfortunately this is not a valid assumption for biological sequence data.

When the possible number of character states is limited, then one expects to observe convergent evolution. Because DNA has only four possible character states, two unrelated DNA sequences would be expected to have the same nucleotide present in roughly 25% of all positions. Two random aligned sequences would be expected to share somewhat more than 25% sequence identity (why?).

Because of this, under some conditions parsimony methods will be inconsistent

Although amino acid data have more character states than DNA and are therefore probably less

Models of DNA Sequence Evolution

Jukes-Cantor (JC)

All substitutions are equally likely

All nucleotides occur with equal frequency

Kimura Two Parameter (K2P)

Transitions and transversions can occur at different rates

All nucleotides occur with equal frequency

A

C

G

T

A Transversion Transition Transversion

C Transversion Transversion Transition

G Transition Transversion Transversion

T Transversion Transition Transversion

In the evolution of real sequences transitions are typically observed more often than transversions.

Example of a substitution probability matrix consistent with the K2P model.

A

C

G

T

A 0.6 0.1 0.2 0.1

C 0.1 0.6 0.1 0.2

G 0.2 0.1 0.6 0.1

T 0.1 0.2 0.1 0.6

These values represent the probability of the corresponding event occurring within a unit of time, t.

The values in the diagonals are selected such that each row adds up to one. Each row has to add up to one because the substitution matrix takes into account all possible events within the model.

Felsenstein 1985 and Hasegawa, Kishino, and Yano, 1985 (F84/HKY85)

Transitions and transversions occur at different rates

The four nucleotides can occur with different frequencies

General Time Reversible

Each of the six possible substitutions occurs at a different rate, but rates are always symetrical, i.e., the rate for A being substituted by C is equal to the rate for being substituted by A.

Nucleotides can occur with different frequencies.

Modeling site-to-site rate variation

Invariant sites model

Gamma model

Minimum Distance

Pairwise distances can be aggregated into a phylogenetic tree

Search for the tree that minimizes discrepancies among pairwise distances

May or may not use an explicit model of sequence evolution

How the distances are calculated and how the tree is found can be mixed and matched

To know what method is being used, you have to know both how the distance matrix was constructed, and how the tree was determined

Likelihood

A model of sequence evolution can be used to relate the data to a hypothesis (typically a tree topology).

Maximum likelihood

Search for the tree that maximizes the likelihood function

The idea is to find the tree that is most likely given the data and the model

Bayesian analysis

Typically uses a Monte Carlo algorithm

Estimates probabilities for branch lengths and tree topologies

Properties of analytical methods

Consistency: A method is consistent if it is more likely to find the correct answer with more data.
Power: A method is powerful if it can find the correct answer with very few data.
Accuracy: A method is accurate if in multiple trials it produces answers that follow a normal distribution centered on the correct answer.
Precision: A method is precise if in multiple trials it finds answers that are very close to each other (i.e., have low variance).

Felsenstein, Joseph. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, MA.

Hillis, D.M., C. Moritz, and B.K. Mable, eds. 1996. Molecular Systematics, 2nd Ed. Sinauer Associates, Inc. Sunderland, MA.

Edwards, A.W.F. 1972. Likelihood, Expanded Edition. Johns Hopkins Press, Baltimore.

Hennig, W. 1966. Phylogenetic systematics. University of Illinois Press, Urbana.

Bioinformatics Home

Syllabus

Links

Reading

	A	C	G	T
A		Transversion	Transition	Transversion
C	Transversion		Transversion	Transition
G	Transition	Transversion		Transversion
T	Transversion	Transition	Transversion