Statistical Inference and Mutation


I. Statistical inference

A. Significance and probability

1. How many coin flips which give heads does it take to convince you that a coin is not fair?

2. The probability associated with 5 heads is (1/2)^5 = 1/32 = 0.03. With 4 heads the probability is (1/2)^4 = 1/16 = 0.06. Scientists agree that when something has a probability of occurring which is less than 1/20, then chance alone is insufficient to explain why it has occurred. Consequently, we refer to such a result as significant.


3. Note that the probability of flipping a coin 5 times is obtained by multiplying the probability of flipping a head successively. This assumes that each coin flip is independent of the last.


4. If you wanted to know what the probability of flipping 5 heads *or* 5 tails is, you would have to use (1/2)^5 + (1/2)^5 = 1/16. The sum of all possible head flip combinations must equal 1.


B. Chi-squared test

1. Used on event data, i.e. counts of individuals or things. Can be used to test goodness-of-fit to a theoretical distribution or to test for independence between two categorical traits.

2. Goodness-of-fit: sum[(obs-exp)^2/exp] for all categories. This sum must increase with the number of categories, so the test statistic associated with chance also increases with number of categories. Degrees of freedom represent the number of independent categories, i.e. n-1 since sum of n = 1.

3. Example: 45:55 versus 5:15 sex ratios

obs: 45 55 | 5 15
exp: 50 50 | 10 10

dev: -5 5 | -5 5
dev^2: 25 25 | 25 25
d^2/e: 1/2 1/2 | 25/10 25/10
chi-sq: 1 | 5

C. Randomization approach

1. How do we know if Mendel's results from 929 flowers (705 red, 225 white) fit a 3:1 segregation ratio? You need to conduct a statistical test to see if your results deviate from this expectation. One way to do this is to simulate the process. Have a computer deliver 929 random numbers. Every time the number is greater than 0.75, count a white flower, every time it is less than 0.75, count a red flower. Do this 100 times.


2. Does Mendel's data fall within 95% of the observed number of red flowers? If so, it does not deviate from 3:1. You can use a chi-square goodness-of-fit test to make this same comparison. In both cases, you must specify the level of confidence you wish to place in the outcome. Normally, we only reject the null hypothesis, that of no difference, when deviations from chance must occur more than 5% of the time to explain the results.


II. Alleles in populations - Hardy-Weinberg equilibrium

A. Mathematicians Hardy and Weinberg showed in 1908 that Mendel's laws can be used to predict genotype frequencies in populations, under the following assumptions

1. random mating
2. large population size (no drift)
3. no selection
4. no migration

B. Mendel conducted controlled crosses. Thus, the frequency of C and c in the gametes was always 0.5. What if alleles are present at different frequencies? Take f(A) = 0.1, f(a) = 1- 0.1=0.9

1. Assuming that gametes are distributed at random (imagine starfish spewing millions of eggs and sperm into the water column), then genotype frequencies will form at the following frequencies:

a. AA = 0.1 x 0.1 = 0.01 (p^2)
b. Aa = 0.1 x 0.9 = 0.09
c. aA = 0.9 x 0.1 = 0.09 (2pq)
d. aa = 0.9 x 0.9 = 0.81 (q^2)

2. What are the allele frequencies in this generation?

p' = 0.01 + 1/2 (0.18) = 0.1. Note that the allele frequencies have not changed.

This can also be demonstrated algebraically:

p' = p^2 + 2pq (1/2) = p^2 + pq = p(p+q) = p = p
p^2 + 2pq + q^2 p^2 + 2pq + q^2 (p+q)^2 1

3. Consequently, the Hardy-Weinberg equilibrium can be used to calculate the genotype frequencies, as long as individuals mate at random and there is no selection, migration, or drift. We will discuss how these forces cause evolution next week.

4. You can use the chi-square goodness-of-fit test to determine if observed frequencies fit Hardy-Weingberg expectations.


III. The central dogma and mutation

A. DNA codes for RNA which is processed and then translated into proteins
B. The coding regions of genes are called exons, noncoding regions are introns
C. The genetic code specifies which three base pairs code for each amino acid

1. All but one amino acid are coded by more than one codon, i.e the code is redundant
2. Some amino acids affect the three dimensional structure of a protein, e.g. methionine. Consequently, some changes can affect protein function

C. The source of all allelic variation is mutation. Mutations within genes can be

1. point (single nucleotide) mutation:

a. transition: pyrimidine -> pyrimidine or purine -> purine (e.g. C->T).
b. transversion: purine -> pyrimidine (e.g. G -> T)
c. Transitions are usually much more common than transversions

2. insertion: ATCGAT -> ATCATTGAT - can cause frameshift
3. deletion: ATCGAT -> ATGAT - can cause frameshift
4. duplication

a. ATTATTATT; CTCTCTCT
b. fragile X & Huntington's chorea are due to trinucleatide repeats

5. inversion - AGCT -> TCGA

D. The redundancy of the genetic code means that some point mutations will not alter protein sequence because they either occur at the third position or are at a redundant site. These are silent mutation and are called synonymous substitutions

E. A variety of agents can increase mutation rates and can cause cancer

1. UV radiation - high altitude populations experience more skin cancer
2. x-rays - wear lead aprons
3. chemical mutagens
4. natural ionizing radiation or from radioactive fallout (Chernobyl)
5. hormone disruptors

a. breast cancer has been linked to estrogen in the environment
b. a variety of pesticides can degrade and have estrogenic effects

F. Can compare synonymous with nonsynonymous substitutions in protein-coding genes between pairs of sister species to estimate the rate of deleterious mutations (Nature 1999 397:344-346)

a. Synonymous substitutions are neutral with respect to survival and can be used to estimate overall mutation rate of nonsynonymous substitutions since hominids diverged from chimpanzees (use gorilla as an outgroup). Compared 41 genes and 41,171 nucleotides.


estimated # nonsynonymous substitutions: # = 231 rate = 0.0056 /bp
observed # nonsynonymous substitutions: # = 143 rate = 0.0034 /bp

b. Assuming that there are 60,000 protein-coding genes, 1,500 bp/gene, generation time is 25 years and a common ancestor of 6 million years ago

c. rate of amino acid altering mutations = 4.2 ± 0.5 / diploid genome / generation
d. rate of deleterious mutations = 1.6 ± 0.8 / diploid genome / generation
e. Since every deleterious mutation must be eliminated by death, this means there is more than one genetic death for every individual! How can the species survive?

f. Answer may lie in removing mutations in groups. This requires that mutations co-occur in individuals. The only way for this to occur is recombination (i.e. sex)! An important evolutionary advantage of recombination is to eliminate deleterious alleles. This may explain why parthenogenetic lineages, such as whiptail lizards, are short-lived in evolutionary time.