CBCB at UM
Chris Burge's lab
Chris Lee's lab
M. Zhang's lab
The Black lab
Genefinding on Yahoo!
Yi Xing's blog
Steve Mount's blog
email - Steve Mount
Consensus (jump to matrices)
It is well-established that nearly all splice sites conform to consensus sequences (matrices). These consensus sequences
include nearly invariant dinucleotides at each end of the intron, GT at
the 5' end of the intron, and AG at the 3' end of the intron.
Splice site consensus sequences for U2 (major class) introns in pre-mRNA generally conform to the following consensus sequences:
3' splice sites: CAG|G
5' splice sites: MAG|GTRAGT where M is A or C and R is A or G
The most common class of nonconsensus splice sites
consists of 5' splice sites with a GC dinucleotide (Wu
and Krainer 1999). GC sites conform extremely well to the standard consensus sequences at other positions. 42 of
44 sites have a consensus G residue at both position -1 and position
5, and GC-AG introns have an enhanced match to the 3' splice site consensus (Thanaraj and Clark, 2001; PubMed). It is reasonable to assume that GC sites are recognized by the standard
The second class of exception to
splice site consensus is U12 introns, a minor class of rare introns with
splice site sequences that are very different from the standard consensus,
but which are very similar to each other (reviewed by Burge
et al 1999 and Tarn
and Steitz 1997. U12 introns can be identified by highly conserved
sequences at the 5' splice site, (RTATCCTY; R = A or G; Y = C or T); and
branch site (TCCTRAY). U12 introns are found in many eukaryotes, including
Drosophila melanogaster and Arabidopsis, but not
Finally, there are a small number of nonconsensus sites
that fit into neither of the two categories mentioned above. Many reports
of such variant splice sites can be traced to errors in annotation or
interpretation, polymorphic differences between the sources of cDNA and
genomic sequence, inclusion of pseudogene sequences, or failure to account
for somatic mutation. However, there are many examples of sites that match
the consensus very poorly, including cryptic splice sites which are used when a nearby natural site is inactivated by mutaiton (see, for example, Roca et al. 2003), and experimental work has established that splicing in vivo is possible even without the core dinucleotide GU or AG.
Splice site consensus sequences vary by species. Consensus sequences for humans and flies are reported here. A survey (making the point that species with more introns have weaker splice sites) was published by Irimia et al. (2007).
Splice site predictors are available on the web.
I recommend SplicePort.
In addition to splice site prediction, the web site allows you to browse the features that contribute to the strength (or weakness) of any given site. Right now, feature browsing is only available for mammalian sites (using a classifier trained on human data), but you can carry out splice site prediction on Arabidopsis as well.
For high throughput assessment of splice sites I recommend GeneSplicer. For analysis of other species on the web I recommend NetGene (available through the Center for Biological Sequence Analysis at the Department of Biotechnology, The Technical University of Denmark). These programs use information in the region flanking a splice site. If you wish to evaluate only the core splice site in order to assess its strength indpendent of additional signals, then I recommend MaxEntScan, which looks at nine nucleotides at the 5' splice site or 23 nucleotides at the 3' splice site.