I was working on a material that I put together as an example of studying the Ebola data and strangely code that ran fine in the Fall now did not produce any results. Quite the head scratcher ...
I ended up troubleshooting it for a while only to realize that most of the new data deposited into the main Ebola project does not actually map to Ebola. At all. In fact 541 files map at rates below 1%!
Below is the code used to evaluate the mapping percentages for all data (891 files) in this project, see the results at the end). Hopefully it might save some time to someone else.
I was curious to find out where the more than 2.5 million reads not mapping to Ebola come from. First I mapped them to some human genes (dhfr, actb, 18S rRNA), but does not got any hits. The serum seams to be free of human RNA. Then I mapped to some bacterial rRNA operons from several phyla (proteobacteria, firmicutes, actinomycete). I got diffuse mapping with thousands of reads for all of them. Thus the major fraction of RNA in this sample is bacterial rRNA from a broad spectrum of species. Presumably the serum was stored a room temperature for an elevated period of time.
I also noted that all of the read sets with high amount of reads mapping to Ebola belong to the 99 fully assembled sequences submitted to Genbank along with the Science paper. The other read sets seam to be just the trash of that great study.
That's pretty cool. Thanks for sharing, I wish more studies had posts like this were we can talks about it.
Also since we are talking about it, here is the original post that I was trying to reproduce: Mission Impossible: you have 1 minute to analyze the Ebola Genome As it turns out this time one needs to pick the right data.
I've selected the first 20K (paired) reads from the samples so that I could map all samples (891). I wonder was the mapping percentage 0.25% a good approximation?. But you are right in that this may be sufficient. I have also noted low sequencing accuracy for the samples with low coverage.
The ratio of mapped reads is 0.3 % in my mapping, which is very close to your result obtained with only the first 20K.