Hello,
I'm working with RNA-seq libraries coming from human plasma cell-free RNA, which is very fragmented/degraded (the bioanalyzer shows peaks around 100-200bp). When performing the alignment of those sequences we see that the majority of the reads map to introns/intergenic regions and we obtain less than 10% of the reads mapping to exons.
Our libraries are total RNA with rRNA depletion.
Seeing this, we thought that it was DNA contamination, but after performing a DNAse treatment (and ensuring that there is no DNA by measuring it by Qubit), the samples don't seem to improve (we see the same exonic percentage). Also, when checking the bam files in the UCSC we see that there is a lot of reads that map all over the genome, which is characteristic for DNA contamination.
Could there be another explanation than DNA contamination for a high quantity of intronic/intergenic reads?
Thank you very much! Lluc
What is the length distribution of your reads after scanning/trimming? You knew that the sample was fragmented/degraded. Perhaps they have become too small post-trimming and are simply aligning by chance. FastQC trace will be enough.
We also thought about that, but with the STAR output we see that the majority of the aligned reads are uniquely aligned. Checking for the avg_input_read_length from STAR it is >150, so I don't think this is the reason.
Just to confirm. You are getting 150 bp unique read matches in intronic/intergenic regions? At what depth?
The avg_mapped_read_length is very similar to the avg_input_read_length (slightly higher). I don't know if there is a difference in mapping between exonic and intronic/intergenic regions, I will check that
If it is a read/few reads then may be ok but if you are seeing equivalent pileups as real data then that is puzzling. Any chance you are dealing with a contaminated batch of reagents somewhere?
I don't think so, because this has happened with multiple batches of samples and with different RNA extraction/library preparation kits.
I can confirm that the distribution of mapped reads between genic and intergenic regions are the same
Could be a bit far fetched, but maybe you can clarify what else was sequenced on the same sequencing late with your samples. A couple of times we saw a "leakage" of libraries within the same seq. lane.
It could be an option, but we have seen this in different sequencing runs. I will ask to see if this could be the reason, thanks!
This would only apply if your samples were mixed with others and sequenced as a super pool.