BSCI348s: Bioinformatics in Genomics and Evolution

FAQ

For information on Computational Biology and Bioinformatics as a career, look at the page Preparing for a Career in Computational Biology

Has the course BSCI348s been approved for Biology Specialty Areas?

As of 8/31/99, only the General Biology (BGEN) specialty has approved the course. However, the other specialty areas are currently considering it, and are likely to approve it for their specialization. For more information, contact Ann Smith (as38@umailsrv0.umd.edu).

How much Computer Science/Molecular Biology do I have to know for this course?

The course is designed primarily to serve biology majors, but is specifically intended to serve majors in computer science and allied fields as well. Students with a reasonable background in the biological sciences (as indicated in the prerequisites) and reasonable comfort with computers should feel free to enroll in the course.

How can I be sure that I have found a gene family?

The original question: I have a question about the assignment for Friday. I have been experimenting with the NCBI webpage, and I am familiar with the concepts of the blast search and the entrez programs. I am unsure on the procedure fo finding the gene family. I don't know what to look at in each of the Genbank files or the blast results to determine the relationship in genetic families. Is there something that I am missing? Thanks

Answer: As I said, this is challenging, so don't feel bad. It will be hard for you to be sure whether or not you have found a genuine gene family, as opposed to a group of homologs referred to by different names. By the end of the semester you will know how to be sure, but for now (early in the semester) I'm mostly interested in having you explore blast and become familiar with that environment, and I also want to expose you to some of the ideas and terminology.

So here is a trick that can help you identify a gene family: because (by definition) gene families are the result of a gene duplication, if you look at the results of a blast search and find two good hits from the same genome, then there is a good chance that you have found two members of a gene family.

You will certainly find large and complex gene families if you search with actin or tubulin sequences. Because you want to find relatively distantly related sequences, you want to cast a wide net, so I would recommend that you either use blastp to search with the inferred amino acid sequence, or use blastx to search for the translation of the nucleotide sequence.

I tried this: I searched genbank with the word "tubulin". This yielded several thousand hits, many of them cDNAs. Any of these would probably work just fine, but I didn't really want to work with a cDNA sequence, so I used entrez to further refine the search to exclude cDNAs. From that list, I selected the Mus musculus (mouse) delta tubulin (AF081568) to use for a blast search. I then used the amino acid sequence, which I cut and pasted from the genbank report, and BLASTP to do a blast search. This pulled up both gamma and beta tubulins.

From there you can go back and check to see if mice really do have more than one tubulin gene. I used the entrez browser to search on ((tubulin[All Fields] AND Mus musculus[Organism])) BUTNOT (cDNA[All Fields]) -- that showed me six sequences from mice, including a beta and a delta tubulin. So the fact that mice have two different genes that encode different tubulins makes it pretty clear that you are dealing with a gene family.

But remember that entrez is searching on various annotations that people have entered into the database, and blast is searching for patterns of sequence similarity. Both of these kinds of searches can produce results that aren't exactly what you want. There is another story here too -- note that there are not a lot of "delta" tubulins. The mouse "delta" tubulin is probably more properly called a gamma tubulin. But tubulins are definitely part of a gene family with at least three genes, alpha, beta, and gamma tubulin present in many organisms. I can explain that in more detail if you are interested.

Ultimately what you are looking for are sequences where a) some organisms have more than one copy of a given gene, b) the sequences are sufficiently divergent that it is unlikely that they are the products of a recent duplication, and c) the sequences are intact open reading frames, not pseudogenes.

Bioinformatics Home

Syllabus

Links

Reading