I have several lanes of paired-end Illumina RNA-Seq data in mouse, but in some lanes, less than 20% of reads map to known exons, indicating that much of the other 80% is likely contamination by genomic DNA (rather than cDNA derived from RNA). On the other end, the "best" sample has almost 80% of reads mapping to known exons.
What is a typical value for fraction of genomic contamination in an RNA-Seq dataset? Can I do anything useful with a lane of RNA-Seq where 80% of the reads aren't RNA-derived? How about 50%? 30%? 20%? I was hoping to use these to study alternative splicing, but I assume that the genomic reads would cause many false-positive cases of intron inclusion and alternative 3' and 5' splice sites. Could I still study other types of splicing events such as exon skipping and cassette exons, since these types of splicing variations would result in long insert lengths that would not be confused with genomic DNA?
how do you know it isn't rRNA?
I didn't personally compute the statistics, but I believe that for the purposes "known exons" included non-protein-coding transcribed sequences such as ribosome genes.
I didn't personally compute the statistics, but I believe that for the purposes of this calculation, "known exons" included non-protein-coding transcribed sequences such as ribosome genes.
I've talked to the person who computed the statistics. Since there are many copies of the ribosomal DNA in the genome, any ribosomal reads would be filtered out because they align to too many locations.