Become familiar with the blast family of alignment algorithms. This includes the empirical properties of estimating error and significance. Get a sense of the size and scope of GenBank and study how different BLAST variants behave when used to search over GenBank.
By the end of 2002 the GenBank database had over 28x109 base
pairs of DNA sequence data. Some of this has been annotated, but much of
it either has no or even worse; incorrect annotation. Given this
tremendous quantity of uncharacterized information, how can one find
sequences of interest!?
Among the most important algorithms currently used to search sequence databases (as of 2003) are a family of algorithms based on BLAST, the "Basic Local Alignment Search Tool." BLAST performs particularly well with protein-coding sequences. BLAST is much faster but potentially less sensitive than an older algorithm called FASTA which is also used to search large sets of data. FASTA may indeed prove more effective with non protein-coding DNA sequences.
Searching a large sequence database is a difficult problem because there are many ways the query sequence might align with sequences in the database. In order to expedite this process, BLAST looks for small regions of perfect match between the query and target sequences, and then examines the sequence that adjoins these regions to see if there is a longer stretch that matches perfectly.
Consider the following DNA sequence:
ATTTGGAGCATCATGCCTGCAAACTCCGAGAAGGAGCACCTCTCCATCGT GATTTGCGGCCATGTCGACAGTGGCAAGAGCACCACAACAGGGCGGCTC TCTTCGAGCTCGGTGGCCTTCCAGAGCGCGAACTTGACAAGCTGAAGCA GAGGCTGAGCGTCTTGGGAAAGGTTCTTTCGCCTTTGCATTCTACATGGA CCGGCAGAAGGAGGAGCGTGAGCGTGGGGTGACCATCGCTTGCACCACG AGGAGTTCTACACCGAGAAGTGGCACTACACAATCATTGATGCACCGGGC CACCGTGATTTCATCAAGAACATGATCACGGGTGCATCCCAGGCTGATGT CGCACTCATCATGGTTCCCGCAGACGGAAACTTCACGACAGCAATCGCCA AGGGCAACCACAAGGCGGGGGAAATCCAGGGCCAGACCAGGCAGCATTCC CGGCTCATCAACTTGCTTGGCGTGAAGCAGATCTGCATTGGCGTGAACAA GATGGACTGCGACACGGCGGCATACAAGCAGGCCCGTTATGATGAGATTG CAAATGAGATGAAGAGCATGCTCGTGAANGTCGGGTGGAAGAAGGACTTT ATTCGAGAAAACACACCCGTGATGCCCATCT
This DNA sequence was obtained by arbitrary screening of a cDNA library. We would like to learn more about the sequence. One easy way of gaining insight into a sequence is to find out whether or not it resembles seqeunces previously characterized. BLAST is capable of comparing the sequence to the GenBank database maintained by NCBI (the National Center for Biotechnology Information, a branch of the NIH National Library of Medicine). We will use the sequence above as a query sequence, and use BLAST to compare the query sequence to the GenBank database. The actual analysis will be run on a massively parallel supercomputer operated by NCBI as a service to the research community. There are several ways to submit searches to the blast server; we will start with the web interface.
First, copy the sequence. Then go to the NCBI web site:
follow the link for BLAST on the NCBI home page, and then the link for
Nucleotide-nucleotide BLAST [blastn].
The page will be replaced with a page called "formatting BLAST." Notice that it provides you with a blast ID number, an estimate of how long it will take for the results to be returned, and some formatting options.
Return to the 'Formatting BLAST" page and click on the FORMAT button. The results of your search will be displayed. There is information on how to cite this analysis in scientific publications and on the nature of your search, followed by a set of colored lines that illustrate the results of the search, and then text describing the results of the search, and below that more text showing examples of the best matches.
Mouse over the colored lines and notice how the display changes. Look at how this information correlates with the text further down the page, and notice that there are links to the sequences which the query sequence matched. Take some time here and try to look at all of the features on this web page.
Some questions about the test sequence:
Take a moment to read over the bit score, e-value, and each of the individual matches. Follow some of the links provided in order to get a sense of the web of information.
Recall that the sequence was from a cDNA library. That means that it is probably a protein-coding sequence. Blast is more sensitive to subtle patterns in amino acid sequences than in nucleotide sequences, so it can be helpful to try a search that takes advantage of the information that this is a protein coding sequence. Furthermore, we do not know the reading frame of this sequence, thus we will want to search a translation of the sequence in all six frames against a protein database.
Because you are working with a nucleotide sequence, you will need to perform a translated search. Return to the BLAST home page and under Translated BLAST Searches select Nucleotide query - Protein db [blastx].
Notice that there are a number of other options you can select, but do
not change them.
Some questions regarding the translated blast search:
Please recall that each amino acid is encoded by three nucleotides, but
that an amino acid sequence also consists of one-third the number of
characters as its corresponding nucleotide sequence. Given that information,
and the fact that the genetic code is degenerate:
Consider the different options, including parameters, that can be set from the BLAST page. Can you determine what effect each of these will have? Some control the way in which the BLAST results are formatted, while others control how the algorithm itself will function.
Change the word size from 11 to 7 and repeat the BLASTN search. Are the
results identical to the word size 11 search? How do the two searches differ?
What happens if you use a word size of 15?
Additional unknown sequences are available here. Choose at least two of these and perform BLAST searches. What observations can you make about how to use BLAST most effectively?
Running BLAST from a command line interface
NCBI makes available multiple BLAST clients, blastcl3 and blastall, that can be used to launch BLAST searches from a local computer without using a web interface. Blastcl3 submits BLAST searches to ncbi for computing in a manner similar to submitting them via the web. blastall on the other hand requires a local sequence database, available at: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
Read the online documentation for blastcl3(1) (if your computer does not have a manual page for blastcl3, perform a search on the web to find a copy of the manual page.
Look in your sequences directory (created in exercise 1) and consider how one would perform a blast search on all of your sequences at one time. What if some files contain nucleotide information while others are amino acid? Write a script to perform a meaningful BLAST search for all of the sequences in this directory. The outputs of your blast search should go into a separate directory and all amino acid searches should be kept separate from nucleotide searches.
Perform the some of the same searches via the ncbi web interface and compare the results.
Try out different flags for a single BLAST search. Change the blast program (the -p flag), change the gap penalty, the scoring matrix, and the word size for the alignment and compare the results.