I understand in RNAseq we will always get some level of reads mapping to areas . However I am seeing almost 85% of reads align to outside coding region. i was wondering what may be the cause of this. Does that mean a contamination has over- powered actual data.
you define as "intergenic" everything that couldn't be assigned to a gene? what about intronic regions?
this could eventually be also caused by looking at the wrong strand, or using a different genome version for the annotation and the mapping. did you already check these cases?
Try aligning against the Rn45S sequence and see how many reads you get. I routinely do that with our rRNA depleted datasets to see how depleted they actually are. My guess is that Michael is right you're getting a bunch of rRNA (and probably tRNA). Just look at some of the higher-coverage areas on the UCSC genome browser with the repeatmasker track enabled. I suspect that'll be illuminating.
However I am seeing almost 85% of reads align to intergenic region.
This is imo extremely unlikely and I have never seen that, even with our imperfect annotation of our model the salmon louse. I have done a little randomization experiment on our data by placing random gene models in intergenic regions, and found that for most samples the 99% confidence level background read-count to experience in intergenic regions is 1. I don't have exact figures for reads overlapping intergenic regions though, but 85% is very high. I would check the following
correct annotation version
missing ribosomal genes from the annotation and high level of rRNA
draft annotation with a large number of truncated gene models and missing or truncated UTR's (add some kb flanks to genes and check again)
Are you using the correct reference? ;-)
It sounds like a wet lab problem, are you sure you have removed DNA from your samples?
yes correct genome for aligning. I am looking and the distribution of reads in Bam file
Almost every time this happens it's because someone's using the wrong genome in IGV :)
you define as "intergenic" everything that couldn't be assigned to a gene? what about intronic regions? this could eventually be also caused by looking at the wrong strand, or using a different genome version for the annotation and the mapping. did you already check these cases?
Is your genome of interest/specie well characterized?
mm9 that is well characterized
Are these ribo-depleted samples or polyA-enriched? If the former, then perhaps you're seeing expressed repeat regions (there can be quite a few).
ribo depleted. I think it is sample preparation but I want to make sure
Try aligning against the Rn45S sequence and see how many reads you get. I routinely do that with our rRNA depleted datasets to see how depleted they actually are. My guess is that Michael is right you're getting a bunch of rRNA (and probably tRNA). Just look at some of the higher-coverage areas on the UCSC genome browser with the repeatmasker track enabled. I suspect that'll be illuminating.
Does anyone know why this was deleted?
Nope, can we restore it?
I opened it but I guess the OP deleted this,