In a paper published in Science and commented on Naturenews Vivian Cheung and coworkers described how they found 10000 locations where RNA is different from what you would expect from the DNA sequence. The Naturenews comment mentions that other scientists have seen the same phenomenon. Did any of you observe that (or are you able to check?). Since a lot of people here should have related DNA and RNA sequences we might be able to do a kind of community project. For instance checking whether the phenomenon occurs in multiple species.
BTW there is an interesting comment that indicates that it the phenomenon might have other causes than RNA editing on the "Genomes Unzipped" site here.
Joe Pickrell has a nice post at Genomes Unzipped about this paper. He discusses ways in which paralogs and splice junctions could lead to false positives in their analysis.
Nice idea - pool data from many sources and have a look. Of course, doing such may present issues of consistent methods and technology. My sequencing machine may be more error-prone than yours, for example. Just such a switch in technologies is what got the longevity GWAS study (from the USA, not that from the Netherlands) into so much trouble.
One of the critical parts of the Cheung paper is the inclusion of peptide data to verify some of the RNA editing changes exist as protein and hence would not be complications of sequencing from an RNA source (as opposed to DNA). Thus, if this were to become a community project, this same issue would also need to be addressed - meaning some peptide sequence data would be good to check the validity of RNA-DNA sequence differences.
I was thinking that we might turn the first argument around if we could show that different technologies and different datasets, maybe even different species, show the same RNA edits. But you I agree that technology issues should be addressed very carefully.
I agree that protein data would add to the strength of the findings. If we have large enough sets we might even use existing (real) protein sequence data. Of course automatically translated protein sequence wouldn't help.
As I remember, a couple posters from the Biology of Genome meeting came to a similar number: of the order of tens of thousands. Nonetheless, though I have not looked at these personally, I more trust Joe and my friends: most of these are due to mapping errors. In my opinion, false mapping is becoming the leading source of errors in SNP calling. It only gets worse for RNA-seq where most RNA-seq mappers have not reached the level of accuracy for reliable SNP calling, not to say discovering rare events.
On the other hand, I could not provide a convincing alternative explanation to the high validation rate using Sanger sequencing and MS. Paralogue-causing mapping errors as Joe has mentioned should be more evident given 1kb sanger reads. The authors should have noticed that.
I have noticed similar phenomenon in fungi I work with before I came across that paper. We sequenced genomes for a few diploid strains - they turn out to be highly homozygous. And then, for many transcripts I found apparent heterozygous SNPs that were not present in genomic sequence. So far we are unable to explain that, plus we don't have any experimental evidence.
There is little, if any, chance of incorrect mapping: i) very few paralogous regions in those genomes, ii) no introns, iii) similar results obtained using various mappers.
I haven't look so far for particular errors/editing patterns, but I think it's worth to study this phenomenon in details.
Joe Pickrell has a nice post at Genomes Unzipped about this paper. He discusses ways in which paralogs and splice junctions could lead to false positives in their analysis.
Yes, that is the same post I mentioned. Is indeed very interesting. I will change the term "technical problem" since that might be misleading.
Oops, sorry Chris -- I hadn't seen your edit before commenting.
There now are several interesting comments on Joe Pickrell's post. It is worth to check back if you read it earlier.