There are some interesting criticisms in this paper. But they seem to rest on a term's definition that is not shared between two roughly-described "groups" of scientists. Ewen Birney, the lead analysis coordinator for the ENCODE project, discussed the controversial issue of "functional DNA" in his blog around the time of its publication:
Q. Hmmm. Let’s move onto the science. I don’t buy that 80% of the genome is functional.
A. It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word "functional" so let’s try to tackle this first. Like many English language words, "functional" is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of "functional" works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as "specific biochemical activity" - for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, "broad" histone modifications, "narrow" histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.
Q. So remind me which one do you think is "functional"?
A. Back to that word "functional": There is no easy answer to this. In ENCODE we present this hierarchy of assays with cumulative coverage percentages, ending up with 80%. As I’ve pointed out in presentations, you shouldn’t be surprised by the 80% figure. After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.
However, on the other end of the scale - using very strict, classical definitions of "functional" like bound motifs and DNaseI footprints; places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases - we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as "functional" by intuition) that number goes up to 9%. Given what most people thought earlier this decade, that the regulatory elements might account for perhaps a similar amount of bases as exons, this is surprisingly high for many people – certainly it was to me!
In addition, in this phase of ENCODE we did sample broadly but nowhere near completely in terms of cell types or transcription factors. We estimated how well we have sampled, and our most generous view of our sampling is that we’ve seen around 50% of the elements. There are lots of reasons to think we have sampled less than this (e.g., the inability to sample developmental cell types; classes of transcription factors which we have not seen). A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our sampling) to 20%...
Q. Ok, fair enough. But are you most comfortable with the 10% to 20% figure for the hard-core functional bases? Why emphasize the 80% figure in the abstract and press release?
A. (Sigh.) Indeed. Originally I pushed for using an "80% overall" figure and a "20% conservative floor" figure, since the 20% was extrapolated from the sampling. But putting two percentage-based numbers in the same breath/paragraph is asking a lot of your listener/reader – they need to understand why there is such a big difference between the two numbers, and that takes perhaps more explaining than most people have the patience for. We had to decide on a percentage, because that is easier to visualize, and we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best coveys the difference between a genome made mostly of dead wood and one that is alive with activity. We refer also to "4 million switches", and that represents the bound motifs and footprints.
We use the bigger number because it brings home the impact of this work to a much wider audience. But we are in fact using an accurate, well-defined figure when we say that 80% of the genome has specific biological activity.
Ed Yong's blog pulls quotes from other scientists who think this definition is so broad, as to be perhaps useless, while others question the interpretation but still praise the data:
Indeed, to many scientists, these are the questions that matter, and ones that ENCODE has dodged through a liberal definition of "functional". That, say the critics, critically weakens its claims of having found a genome rife with activity. Most of the ENCODE’s "functional elements" are little more than sequences being transcribed to RNA, with little heed to their physiological or evolutionary importance. These include repetitive remains of genetic parasites that have copied themselves ad infinitum, the corpses of dead and once-useful genes, and more.
To include all such sequences within the bracket of "functional" sets a very low bar. Michael Eisen from the Howard Hughes Medical Institute said that ENCODE’s definition as a "meaningless measure of functional significance" and Leonid Kruglyak from Princeton University noted that it’s "barely more interesting" than saying that a sequence gets copied (which all of them are). To put it more simply: our genomic city’s got lots of new players in it, but they may largely be bums.
This debate is unlikely to quieten any time soon, although some of the heaviest critics of ENCODE’s "junk" DNA conclusions have still praised its nature as a genomic parts list. For example, T. Ryan Gregory from Guelph University contrasts their discussions on junk DNA to a classic paper from 1972, and concludes that they are "far less sophisticated than what was found in the literature decades ago." But he also says that ENCODE provides “the most detailed overview of genome elements we’ve ever seen and will surely lead to a flood of interesting research for many years to come.” And Michael White from the Washington University in St. Louis said that the project had achieved "an impressive level of consistency and quality for such a large consortium." He added, "Whatever else you might want to say about the idea of ENCODE, you cannot say that ENCODE was poorly executed."
While in his blog post I think Birney did a better job of qualifying his statements, there is simply no excuse for the way the 80% figure was handled in the actual publications and, more importantly, the press releases and commentary put out by ENCODE. I also wasn't very happy with the way Birney essentially dismissed the noise argument out of hand. We know that ENCODEs data is noisy and full of false positives, which is ok by ENCODEs job is to generate massive amounts of data to produce hypotheses that can be investigated in greater detail and combined with other data sources for different analyses. But that needs to be clear in their publications and it flat out isn't. Transcription factors will bind to random non-promoter sites. Those sites COULD become functional in the future, but currently aren't. Transcription is noisy, there are a lot of transcripts that get generated from pretty random portions of the genome. To call it functional is silly, plain and simple. I like the ENCODE project overall but I was pretty pissed with the sloppy usage of functional and the amount of false positives generated by the nature of their cutoffs and analyses.
I understand the issue with the false positives. But personally I'd rather have the data and be able to then filter for signal levels I think are appropriate. And it's not hard to ask for only signals over a certain value and proceed with your own analysis afterwards.
I totally agree. I have no problem with the ENCODE data, I think it is a very valuable resource that I use all of the time. What I didn't like was the hype and commentary in the published papers surrounding the data release.
My problem is that the ENCODE concept of function shows a completely failed understanding of biochemistry, evolution and genetics. We know that enzymes are not "perfect". We can reliably predict that DNA binding proteins will bind in useless ways, that RNA transcription will occur in useless places. We also know how selection works. Knowing this, we know that lots of RNA will be transcribed that has no selective advantage with the only disadvantage being energy consumed. And we know that the totality of RNA transcription is less than 1% of the energy costs of the cell. So spandral transcription of even 10% of all RNA transcribed would have such a minimal selective pressure as to make it almost impossible to evolve away. The same goes to added DNA length. This is not some blinding insight. We teach it to undergrads. Graduate students should be able to figure it out for themselves.