Wednesday, March 23, 2011

Characters, characters everywhere...

With the sort of technology available today for sequencing and the speed at which it proceeds, it's no wonder that researchers want assembly processes for double-barrel shotgun sequencing fragments to be faster as well. Thus this contest. Nature doesn't provide the species names or further details about the assembly procedures, but these don't concern me much.

When it comes to full genome sequencing projects, as a systematist I'm more concerned with the characters (the actual sequence of base pairs) and how they are used in phylogenetic inference. With pyrosequencing and other next generation sequencers, very soon it will become inexpensive and fast to sequence the entire genome of any organism. Ignoring the amount of information involved and the amount of digital space it will take to store all this information, what is one to do with all these characters?

Some people, or I should say, many modern systematists would like nothing better than to shove the entire genomes of species within a taxon (never mind that genomes are character sets of individuals, not species) and let an algorithm sift through the mess and work it out. I'd like to point out that this has been done before, and is generally regarded as a unfortunate but necessary stepping stone on the way to more scientifically acceptable methods. Still, the temptation for easy answers is alluring.

Consider this, however. We have a number of assembled genomes (by whatever method) and we have aligned them (hopefully not manually) so we can examine the shared areas. Could we possibly design a program which will automatically find shared sequences lengths and highlight them from longest to shortest, and discard those sequence lengths below a cutoff? Then we could actually look at the sequence lengths that may matter (there still will be homoplasy) and consider these entire shared sections to be our hypothetical homologs. We could then code the sequence lengths as individual characters and run a more traditional style phylogenetic inference. This may actually be faster than the "mass shoving" scenario, as there are less potential relationships for a computer to compare. It also removes a great deal of homoplasy which interferes with our hypothesis testing. More characters (if the characters are not specially shared) is not always better. Millions of characters will not give greater resolution to a phylogeny if 80% of them are either different or shared single base pairs scattered among non-shared lengths. Part of scientific efficiency is designing a crucial experiment which will quickly eliminate alternative hypotheses (PDF). Using an entire genome in a phylogenetic inference is like setting sail on Lake Superior in a kayak without a map or compass and hoping you'll hit Isle Royale after several days travel.


Edit: The process of a priori selecting characters for a phylogenetic inference is not new nor is it unusual. All morphological cladistics works within this method. Ignoring and not including non-pattern (homoplasy) in sequences is no different than disregarding highly variable morphology (e.g. color) or bland uniformity. Like all good science, not including unnecessary data is a matter of efficiency.

Tuesday, March 22, 2011

The Extinction of Taxonomists?

This topic has been covered twice already this year. As one older systematist told me, this is a long term decline, not a sharp decrease. Whether caused by change of careers to molecular biology, a decline in museum and systematics funding, or a combination of the two in addition to correlating factors of alienation from nature, the number of systematists who focus on traditional taxonomy is and has been declining for a quarter of a century or more.

My first response to such doom and gloom is "I'm not dead yet!". The numbers are declining but there are still young people interested in morphology based revisions and alpha taxonomy. My second comment is to suggest greater emphasis on decentralization of research. There's a reason I put "taxahacker" in my title box; individually driven research is the future of our field just as it was the origin. If it means a secret guild with unassuming 9-to-5 office jobs coming home to pour over specimens under the microscope in their basements, so be it. We are the oldest profession, and unless all life on this planet ceases to be there is no getting rid of us.

Monday, March 21, 2011

Book Review: Naming Nature.

My graduate adviser recently loaned me a copy of Carol Yoon's Naming Nature: the clash between science and instinct. Booklist's review stated it's "impossible to put down", which I found to be the case. I finished Naming Nature in one evening, finding it simultaneously inspiring and infuriating, and therefore deeply engaging. The next day I suggested it to all of my colleagues, calling it the best popular science book on systematics ever written.

Yoon's main premise is there exists an innate human tendency to order and classify the natural world in distinct and evolutionarily conserved categories. She calls this the umwelt (prouounced oom-velt), from the german word meaning "the environment" or "the surrounding world". Psychologists use this term to refer to the collective phenomenon in the environment capable of effecting an organism or individual. In ecology it has been used to refer to phenomenon that individuals within a particular species are able to recognize. Yoon suggests there is an umwelt for every species, and that the human umwelt consists of a set of conserved categories and a instinctual need to classify the natural world in consistent ways. The evidence she uses to support the concept is varied but intuitive. I finally now have a word to put on this concept which has been floating around my head for several years, which was as I said, inspiring.

However, as I continued to read, I became increasingly frustrated with her conclusions. The human umwelt is very local in time and space, applying to what an individual sees on a daily basis and qualified by what matters most in terms of survival or aesthetics. This is in contrast to the deeply non-local scientific understanding of life, which extends across the entire planet and backwards in time billions of years. Humans are not generally prepared to drop their biologically and culturally grounded categories for something much more immense, so there is a conflict. Yoon's conclusion is that scientific discoveries (including progress in systematics from Linnaeus to evolutionary taxonomy all the way to Cladistics) have alienated humans from their own umwelten, and her solution is to return to the classic categories, going as so far as to call a whale a fish in the final chapter of the book. All the while she complains about how those nasty cladists (myself included) have "killed the fish", "Fish" not being a monophyletic taxon including a common ancestor and all descendents, therefore invalid as a taxonomic grouping under the rules of Cladistics. Yoon claims that our new categories have made people alienated and apathetic, and therefore caused the extinction of many species.

After providing this exquisite description of the human umwelt as revealed by science, this was all so backwards. Modern systematics through cladistics has added so much to our understanding of Life and our own evolutionary heritage. The many species we know of today were only revealed by the very methods that Yoon claims to be the cause of their demise. And it's very clear that we haven't lost our abilities or we would be as lost as the brain damaged people she describes who cannot tell a lion from a raven. Even those systematists which work with molecules still retain this capacity, although their skills are not as strong as the classic morphological systematist. It seems her ire is misplaced.

The conclusion I have come to is very different. The human umwelt seems indeed to exist, and is conflict with science in it's classic form and categories. However, this conflict is not irreconcilable, nor is it ultimately the cause of species extinctions or human apathy. There are many other causes for these things, which are the subject of an entirely different discussion. The route to reconciliation is retraining of the umwelt to include evolutionarily real categories, which takes great time and effort but is entirely doable. The umwelt consists of a series of gestalten (singular: gestalt; German for "shape or form") which are the snapshots that allow immediate sensory identification of a living thing. Changing of the umwelt consists of training ones gestaltenspeicher ("gestalt-memory") until the categories fall in line with science.

And from personal experience, I can say that I understand and accept these modern categories and do not feel any alienation from the natural world. In fact, my understanding has brought me closer to life. As Eliezer Yudowsky the AI researcher and rationalist has stated on several occasions, "If we cannot take joy in the merely real, our lives will be empty indeed". I strongly suggest reading Yoon's book for the bits about umwalt and some insight into the history of systematics, but not to take her conclusions to heart. Instead, embrace the evolutionary understanding of life and train your own gestaltenspeicher; such things bring so much more joy than any foolish return to ignorance might grant over the short term.

Naming Nature at Amazon.com

More on umwelt from a biosemiotics perspective

The Six Principles of the ICZN: Coordination.

Now that we've established Binomial Nomenclature and Priority as founding principles of zoological nomenclature, we at least have a stable system of classification. However, there are still some loose ends to tie up.

Article 36. Principle of Coordination.

36.1. Statement of the Principle of Coordination applied to family-group names. A name established for a taxon at any rank in the family group is deemed to have been simultaneously established for nominal taxa at all other ranks in the family group; all these taxa have the same type genus, and their names are formed from the stem of the name of the type genus [Art. 29.3] with appropriate change of suffix [Art. 34.1]. The name has the same authorship and date at every rank.

Articles 43 and 46 provide the same statements in relation to genus-group and species group names.

Essentially, when a new name is created in the family group (superfamily, family, subfamily, tribe, subtribe), genus group (genus and subgenus) or species group (species and subspecies), by this principle all other names are "created" on the same date, even if they are not discussed at that time. They exist in a sort of potential state until they are first talked about, at which point the author and date associated with the "new" taxon is referred to publication by which it was created coordinately.

For example, if I published a new family, Ecksidae Burington 2011, by Principle of Coordination Superfamily Ecksoidea, Subfamily Ecksinae, Tribe Ecksini, and Subtribe Ecksina would all be considered considered created at that point, even though I don't talk about them. If another author comes along later and decides to discuss the subfamily Ecksinae, even though it had never been formally discussed before that point, it would still be considered to be Ecksinae Burington 2011 by this principle.

This seems to be a completely unnecessary Principle until you consider the habits of taxonomists. We often enjoy raising or lowering the ranks of various groups to suit ease of classification and/or personal taste. If these groupings weren't already created in potential, you can imagine the cacophony of names and dates that could arise from this process. Again, like all other aspects of The Code, the Principle of Coordination is meant to preserve the stability of zoological nomenclature, in this case by stopping problems before they even arise. This principle also prevents the orphaning of taxa by the "type" clause; I'll be discussing typification more later.

Saturday, March 19, 2011

Changes in Science Publishing.

Recently, Johnathan Eisen and his colleages at UC-Davis published this paper in PLoSOne. The paper deals with shotgun sequencing (breaking up DNA into many small segments and sequencing the pieces, fitting them together where they overlap) of environmental samples from the ocean, and while interesting on it's own merit is not why I bring it up. The very interesting thing about this paper is that the authors have chosen to forgo a press release through the university press office, and have instead used Eisen's professional blog to provide commentary on the article.

I note - we are not doing a press release for the paper, for a few reasons. But one of them is that, well, I am starting to hate press releases. So I guess this is kind of my press release. But this will be a bit longer than most press... releases. I note - my key fear here is that somehow in my communications with the press or in our text in the paper or in this post I will overstate our findings. Here is the punchline - we found some very phylogenetically novel forms of phylogenetic marker genes in metagenomic data. We do not have a conclusive explanation for the origin of these sequences. They may be from novel viruses. The may be ancient paralogs of the marker genes. Or they may be from a new branch of cellular organisms in the tree of life, distinct from bacteria, archaea or eukaryotes. I think most likely they are from novel viruses. But we just don't know.
PZ Myers pointed out just how revolutionary this is, and how it should scare people who are not currently up to snuff on communicating. This is the new model of science, public, accessible, and direct. The obvious advantage is no misinterpretation or distortion occurs as the information moves from authors and journal article through science journalists and reporters to the public. Eisen has made it very clear that he has no explanation for where the strange sequences have come, only hypotheses, and because of this it's impossible for this research to be interpreted any other way (though journalists may still try with hype words). It is very much like giving a seminar on your research for the whole world.

The disadvantage (to those unexperienced with blogging) is every scientist may soon be expected to blog their discoveries simultaneously with publication. In addition, journals will continue to become more open access, so science will be more accessible to the public in raw form, necessitating explanation that the layperson can understand.

In this brave new world, I've decided to become an early adopter. I've changed my profile so this blog has my name and face, and I pledge to blog every single publication from this point on. As I am not yet published (one article nearly in press, two more on the way), this means that I pledge to blog every publication I ever write in my entire life.

In addition, I'm going to be updating this blog more often. I still need to finish the Principles of the ICZN series, which I had forgotten about.

Watch this space; the changes will continue.