Multiple Sequence Alignment

The Problem

Align several homologous sequences

Analogous to pairwise sequence alignment, but applied to more than two sequences

This is a computationally challenging problem

Consider the calculations involved in the Smith-Waterman algorithm; in principle it would be possible to extend the Smith-Waterman algorithm to an n-dimensional matrix and calculate an exact alignment of multiple sequences. In practice this is not computationally feasible.

What are the applications of multiple sequence alignment

Phylogenetic analysis

A powerful technique in multiple sequence analysis, it requires a multiple sequence alignment

Infer relationships both within and among gene families

Can provide functional information

Distinguish orthologs from paralogs

Gene duplication is a critical phenomenon in genome evolution

Infer organismal relationships

Study the evolution of genes, gene families, and genomes

Identify patterns of conservation

Conserved sites, key functional domains are conserved, less important regions are much more variable

Development of a generic representation of a protein family (a profile)

This can be used for database searching, to quickly add additional sequences to the alignment, and to characterize the protein.

Substitution patterns

Amino Acid

PAM - Used an alignment of proteins with a "known" phylogeny

BLOSUM - Only uses ungapped alignments

Nucleotide

Infer models of nucleotide evoution

Why is multiple seqence alignment difficult?

If sequences are all the same length, alignment is trivial:

KNMITGAAQMDGAILVVAATDGPMPQTREHVLLARQVEVP KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP KNMITGAAQMDGAILVVSATDGAMPQTKEHILLARQVGVP KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP KNMITGAAQMDGAILVVSSADGPMPQTREHILLARQVGVP KNMITGAAQMDGAILVVSAPDGPMPQTREHILLARQVQVP KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP KNMITGAAQMDAAILLVAADSGAEPQTKEHLLLAQRMGIK KNMITGAAQMDGAILLVAADSGPEPQTREHILLAKQVGVA KNMITGAAQMDGAILVVAGTDGPMPQTREHILLARQVNVP KNMVTGAAQMDGAILVVAATDGPMPQTREHILLAHQVGVP KNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVP KNMITGAAQMDGAIIVVAATDGQMPQTREHLLLARQVGVQ KNMITGAAQMDGGILVVSGPDGPMPQTKEHILLARQVGVP KNMITGTAPLDGCILVVAANDGPMPQTREHLLLARQIGVE KNMITGAAQMEGAILVVAATDGPMPQTREHLLLARQVGVP KNMITGAAQMDGAILVCSAADGPMPQTREHILLARQVGVP KNMVTGAAQMDGAIIVVAATDGPMPQTREHILLARQVNVP KNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVP KNMITGAAQMDGAILVVASTDGPMPQTREHILLSRQVGVP KNMITGAAQMDGAILVVSAADGPMPQTREHILLSKNVGVP KNMITGAAQMDGAILVVSATDSVMPQTREHILLARQVGVP

This sequence alignment is unambiguous because there is no length variation among the sequences. No indels are needed to make the alignment, and the ungapped sequences can simply be arranged together.

However, if the sequenes are of various lengths, problem is more complex, potentially very complex:

RGSALKAVEAPNDPNHEA......YKPIQELLDAMDN.....YIPDPQRDVDKPFL RGSALKALEGDAAYIEKVR..........ELMQAVDD.....NIPTPEREIDKPFL RGSALKALE.....IEKVR..........ELMQAGDAAYVDDNIPTPEREIDKPFL RGSALLALEEMHKNPKTKRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL RGSALLALEQMHRNPKTRRGENEWVDKIWELLDAIDE.....YIPTPVRDVDKPFL KGSALQALEALQANPKTARGEDKWVDRIWELLDAVDS.....YIPTPERATDKTFL RGTARNALESPSKDIN....APEY.KCILELMNAVDE.....YIPTPQRAVDQPFL KGSALQALE....NAE....DEEKTKCIWELLQAMDD.....YIPAPERDIDKPFL KGSAFGAMS....NPE....DPESTKCVKELLESMDN.....YFDLPERDIDKPFL RGSAFAAMS....KPD....DPAATKCLDELLDTMDK.....YFVIPERALDKPFL KGSALNALN..........GDPEGEKAIMELMDAVDD.....YIPEPVRDVDKPFL RGSALKALE..........GDPEYQDKVMELMNACDE.....YIPLPQRDTDKPFL KGSAKLALEGDKGEL....GEGA....ILKLAEALDT.....YIPTPERAVDGAFL MGSALCALEGRQPEI....GEQA....IMKLLDAVDE.....YIPTPERDLNKPFL RGSALSALQGTNDEI....GRQA....ILKLMDAVDE.....YIPDPVRVLDKPFL VGSALCALEGRDPEL....GLKS....VQKLLDAVDT.....YIPVPARDLEKPFL FGSALCALEGKQPEI....GEEA....VKQLLEVLDN.....KFVIPERKVNEEPM KGSAKLALEGDTGEL....GEVA....IMSLADALDT.....YIPTPERAVDGAFL QGSALGALN..........GVEKWEDKVMELMEAVDT.....WIPLPPRDVDKPFL RGSALKALE..........GDAEWEAKILELAGFLDS.....YIPEPERAIDKPFL QGSALKALE..........GEPEWEAKILELAAALDS.....YIPEPQRDIDKPFL KGSALKALE..........GDAEWEAKIFELMDAVDE.....YIPTPERDTEKPFM YGSALKALE..........GDPKWEAKIHDLIKAVDE.....WIPTPTREVDKPFL QGSALKALE..........GDSKYEDIIMELMNTVDE.....YIPEPERDTEKPLL RTSALKALE..........GDPQWVKSVEDLMDAVDE.....YIPDPVRDKDKPFL RVSALKALE..........GDAKWVASVEELMNAVDE.....SIPDPVRETDKPFL AGSALMALEKMASEPKLIRGKDDWVDCIYSLMDAVDA.....YIPTPERAIDKPFL AGSALQAVDAMIAKGTTKQSENEWVDKIHKLMAEVDA.....FIPTPERIIDKPFL AGSALKAVEALQANSSIGKGEDEWVDKIHDLVAQVDE.....YIPAPERDIDKPFL SGSALKALDFLTENPKTTRGENDWVDKIHALMDEVDA.....YIPTPERDIDKGLL

Insert gaps to take indels (insertion/deletion events) into account

One basic strategy is to first align sequences that are most similar to each other

Once similar sequences have been aligned, align the resulting clusters of sequences against each other

This would be very simple in cases where there are no conficts among the alignments

However, in many cases the best position to place an indel is ambiguous

Ideally, one would know the phylogeny for the sequences; this would help infer the sequence of indels

Unfortunately one normally needs a multiple sequence alignment to make inferences about how the sequences are related

Most alignment algorithms make a quick approximation of phylogeny, and then base alignments on these

Alternatively, if external information on phylogeny is available, can use this information to guide alignment algorithms

Algorithms

Exact algorithms

Needleman-Wunsch, Smith-Waterman, etc. can be extended to an n-dimensional matrix

Not practical at present, so divide and conquer...

Cluster alignment algorithms

Clustal W, Clustal X

Calculate all possible pairwise alignments, record the score for each pair

Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining)

Find the two most closely related sequences

Align these sequences (algorithm: Smith-Waterman).

Calculate a consensus of this alignment

Replace the two sequences with the consensus

Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).

Iterate until all sequences have been aligned

Expand the consensus sequences with the (gapped) original sequences

Report the multiple sequence alignment

PILEUP (GCG)

Pileup is conceptually similar to Clustal, but it uses UPGMA to determine the guide tree, and Needleman-Wunsch for alignments.

Successive alignment algorithms

MultAlign

Motif-based alignment and related algorithms

These tools generally perform both database searching and alignment

Rely on motif libraries

Considerations in preparing alignments

Amino-acid vs nucleotide alignments

As with pairwise sequence alignment, amino acid sequences are more data-rich for alignment

Aligning divergent structural RNA sequences (as with rRNA or intron sequences) can be problematic

Spacer regions are generally under minimal selection, so they diverge very quickly and are difficult or impossible to align at large phylogenetic differences

The causes of sequence similarity

Sequence similarity does not guarantee homology

Possible reasons for sequence similarity

Phylogeny

Descent from a common ancestor

Such sequences are homologous

Function

Functionally similar sequences may undergo convergent evolution

While lengthy sequences are unlikely to be similar because of convergent evolution, sequence motifs may well be

Physical constraints

Membrane-spanning domains have to be hydrophobic, etc.

Repetitive sequences

The information content of repetitive sequences is low, so similarity may be deceptive

A fundamental goal of sequence analysis is to distinguish among these influences

Profile Analysis

Hidden Markov Model (HMM) alignment methods

meme, hmmalign

What is a Hidden Markov Model?

A statistical model of a system that can be in different states

The state of the system can change

Prior states held by the system do not affect the probability of future state changes

Clustal: Available from the EMBL Bioinformatics outstation (ftp://ftp.ebi.ac.uk//pub/)

Bioinformatics Home

Syllabus

Links

Reading