A tradition in the computer science, information technology, and mathematical communities is to issue "challenges." Such challenges pose problems to fellow scientists, and provide an entertaining way to advance the discipline. The form of the challenge itself can be quite variable, but the challenge should pose a problem that is sufficiently difficult to be interesting (and indeed, challenging!), but should not normally be so absurdly difficult that it is unlikely anyone would ever be able to solve it. It is also important that the challenge be sufficiently well defined that it is possible to determine whether or not a person has successfully answered the challenge, or at least to provide an objective measure of performance.
Because the concept of challenges has not been widely applied in the life sciences, we hope that introducing them will promote additional interactions among the biological and analytical communities. Note that none of the challenges listed here currently has a significant prize associated with it. The winner (if any) will be selected arbitrarily by the person who proposed the challenge, who in turn will offer the winner a beer or other beverage of comparable value.
Use imagination and propose your own challenge!
To submit a challenge, send the proposed text for a challenge and relevant web links to Charles F. Delwiche (firstname.lastname@example.org).
A dataset comprising 64 carefully aligned Small Subunit ribosomal RNA(SSU rRNA) sequences with 1620 characters was used by Barns et al. (1996) to calculate a phylogenetic tree spanning all known major groups of living organisms. This analysis presented a maximum likelihood tree (as well as bootstrap values) based on the F84 model of sequence evolution, and assuming site-to-site rate homogeneity.
a) What is the maximum likelihood tree you can find using a GTR + gamma + invariant sites model of sequence evolution?
b) For your best tree, can you demonstrate that it is the globally optimal tree? If not, can you provide a quantitative estimate of the probability that there is another tree of higher likelihood? And finally, how long did it take to determine this tree, measured in both clock time and CPU-minutes?
c) Bonus: can you perform a bootstrap analysis using the same model of sequence evolution?
The 1620 character alignment is available on the web at:
Barns, S.M., C.F. Delwiche, J.D. Palmer, and N.R. Pace. 1996. Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proc. Natl. Acad. Sci. USA 93:9188-9193.
Sean Graham adds:
"With reference to the rDNA data set posted, my challenge would be to identify potentially problematical long branches, those that may have resulted in spurious placement of one or more taxa."
The analyses described in "Tree of Life Challenge #1" all assume stationarity of the mode of sequence evolution. Can you provide a quantitative measure of lineage-specific modes of sequence evolution that can be displayed graphically and apply it to those data? Because the tree presented in the paper was calculated under an assumption of stationarity, it can be expected to minimize inferred differences in mode of sequence evolution. Thus to address the issue of non-stationarity, one would ideally first identify a tree that is optimal under some some justifiable, and biologically meaningful, probabilistic model of sequence evolution that allows for non-stationarity. Once such a tree has been determined, the changes in the mode of sequence evolution that were inferred by the analysis should be displayed on the tree in a graphical manner (i.e., without the use of text, and in a manner that can quickly and easily be interpreted with minimal explanation).
The TreeBASE database (http://www.herbaria.harvard.edu/treebase) currently contains over 1000 phylogenies with over 11,000 taxa among them. Many of these trees share taxa with each other and are therefore candidates for the construction of composite phylogenies, or "supertrees", by various algorithms. A challenging problem is the construction of the largest and "best" supertree possible from this database. "Largest" and "best" may represent conflicting goals, however, because resolution of a supertree can be easily diminished by addition of "inappropriate" trees or taxa.
Originally posed following the discussions at Deep Green - Princeton, the 232-Taxon challenge is still open.
As for the challenge(s): My only suggestion is very simplistic, but I am not sure it has been done.
I would like to see an intensive analysis of a complete data matrix for green algae and land plants in which 3 different genes are used and for which every square in the matrix is filled (i.e., no missing data, which will cut the total number of taxa down a bit. Given one or more analyses of the massive data set, what is the permutation of the results with random "extrinctions" of taxa ? With random eliminations of data points? Is there a % random reduction from the perfect data matrix (in terms of taxa and/or in terms of data points) that SIGNFICANTLY alters the result of the analysis (analyses)?
I realize lots of this sort of thing has been done, but I don't know if much has been done with a real green plant data matrix. Of course, I am not sure of how big the current "prefect" matrix is yet for all green plants, but I thought it should now be large enough to set the stage for simulated taxon extinctions and simulated missing data for real organisms and real data.
[CFD adds: Another issue to consider is whether or not extinction affects taxa at random. I suspect that under at least some conditions certain clades would be disproportionately prone to extinction, so a logical extension of this challenge would be to examine the effects of "patchiness" in extinction on phylogenetic reconstruction.]
|Return to Deep Green - College Park Homepage||Go to the Deep Green (GPPRCG) Home Page|