Modeling Nucleotide Sequence Evolution

  1. The characteristics of nucleotide sequences
    1. Four nucleotides
      1. A, C, G, T or U
      2. Ambiguity codes: S, W, R, Y, B, D, H, V, N, X
    2. Arranged in a series with directionality
    3. Some sequences are protein-coding, others are not
    4. Replicated and maintained by cellular machinery for this purpose
      1. The characteristics of the replication and repair mechanisms of the cell affect how nucleotide sequences evolve.
      2. For example, some species have biased base compositions
  2. Markov models of nucleotide substitution
    1. Markov models assume that there is no "memory" in the system: only the instantaneous state of a character is important
    2. The probability of change from state i to state j depends upon the amount of time that has passed and the substitution rate.
    3. Because well determined time points are not usually available for molecular data, the product of rate and time (equivalent to a genetic distance) is more commonly used.
  3. Mutation
    1. Point mutation
    2. Categories of point mutation
      1. Transitions & transversions
      2. Purines: A & G Pyrimidines: C & T
      3. Despite the fact that each nucleotide has only one transition but two transversions available, in most sequence comparisons, transitions are found to occur more frequently.
      4. This is because most living cells have mechanisms to detect mismatched base pairs.
    3. Insertion and deletion events ("indels")
      1. We will not consider indels here, but they are an important part of sequence evolution. Development of effective models of indel evolution is a current area of research.
  4. Superimposed substitutions ("multiple hits")
    1. Critical to models of nucleotide evolution is the realization that because there are only four possible character states, it is expected that as genetic distance increases, some sites will undergo multiple superimposed substitutions.
      1. In this case, some sites that have undergone change will have reverted to the state they were originally in.
      2. To reliably reconstruct evolutionary relationships among divergent sequences, this expected reversion must be taken into account.
    2. Simple measures of distance that do not take multiple substitutions into account are said to be uncorrected. Corrected distances use one of several models of sequence evolution to estimate the number of sites that have undergone multiple substitutions.
      1. Uncorrected methods tend to underestimate genetic distance.
      2. While uncorrected measures are clearly inadequate, selecting the most appropriate measure to use requires and understanding of models of sequence substitution, and the assumptions that underlie each of them.
  5. Models that assume all nucleotides occur at equal frequencies (25%)
    1. The Jukes-Cantor (JC) model
      1. All substitutions are equally likely.
      2. All nucleotides occur at the same frequency (25%).
      3. One parameter: the rate of subsitution (alpha).
    2. Kimura two parameter (K2P) model
      1. Transitions and transversions happen at different rates.
      2. All nucleotides occur at the same frequency.
      3. Two parameters: transition rate (alpha) and transversion rate (beta).
  6. Models that allow the four nucleotides to be present in different frequencies
    1. Felsenstein (F84) & Hasegawa-Kishono-Yano (HKY85) models
      1. Two closely related models -- they use different calculations to model essentially the same thing
      2. Transitions and transversions occur at different rates
      3. Nucleotides occur at different frequencies
    2. General time reversible (GTR) model
      1. Assumes a symmetric substitution matrix (and thus is time reversible)
      2. In other words, A changes into T with the same rate that T changes into A.
      3. Each pair of nucleotide substitutions has a different rate
      4. Nucleotides can occur at different frequencies
  7. Relationship among these models
    1. These models are closely related
    2. Nested models are special cases of more general models.
    3. A model is said to be nested in another model if the simpler model is equivalent to a specific setting of the more complex model.
    4. Thus the JC model is a special case of the K2P model: if the transition and transversion parameters of the K2P model are set to the same value, it is equivalent to the JC model.
More complex model(s) Corresponding Nested Model(s)
GTR JC, K2P, F84, HKY85
F84, HKY85 JC, K2P
K2P JC
  1. Other models
    1. Several other models of sequence change are available.
      1. Several other special cases of the GTR model have been described and named
      2. Models that are not time reversible (i.e., have asymmetric substitution matrices) have been described
      3. Some methods (e.g., LogDet) are available that use dramatically different models of sequence substitution
        1. These methods are of particular interest because they are not nested within the GTR model, and consequently have different underlying assumptions
    2. The methods described here all assume that each position is evolving independently and identically
      1. Site to site rate variation has also been modeled
        1. Invariant sites model
        2. Rate distribution models
          1. Gamma model
          2. Van de Peer's method
      2. Special models for protein-coding sequences are in development
      3. Non-independence is a serious concern. Some work has been done to examine the effects of non-independence of sites, but this needs more attention.
    3. Lineage-specific models of sequence evolution
      1. A further complication is introduced if different lineages are evolving differently.
      2. Base compositional bias
        1. LogDet has been used successfully in cases where base compositional bias would violate the assumptions of the GTR family of models.
      3. Linked Markov Chains and other variations on Markov models can also model lineage specific evolution
    4. Are there other aspects of sequence evolution that can be modeled?
      1. Indels
      2. Constraints imposed by RNA secondary structure
      3. Others?
  2. How to choose what model to use
    1. In general, use the simplest model that adequately explains the data
    2. If a more complex model yields a greater improvement in tree score (or other measure of goodness of fit to data) than would be expected if applied to random data, then use the more complex model.

Hillis, D.M., C. Moritz, and B.K. Mable, eds. 1996. Molecular Systematics, 2nd Ed. Sinauer Associates, Inc. Sunderland, MA.

Li, W.-H. 1997. Molecular Evolution. Sinauer Associates, Inc., Sunderland MA.

Bioinformatics Home
Syllabus
Links
Reading