Hi,
I have RNASeq data from mouse and am trying to (1) align it using STAR, and (2) estimate gene counts using HTSeq. I am using the .fa and .gtf files from Ensembl that correspond to mm9 (Mus_musculus.NCBIM37.67.dna.toplevel.fa
, Mus_musculus.NCBIM37.67.gtf
).
My question is what is acceptable percentage of reads that show "no features" and "not unique" as the result of htseq-count function?
Here is the parameters I used for htseq:
​samtools view \
-h $sorted.bam | \
python2.7 htseq-count \
-m intersection-nonempty \
-f sam \
-r name \
-s no \
-a 0 \
-i gene_id \
-o samout.out \
- \
Mus_musculus.NCBIM37.67.gtf > result.count
For example, one of my RNASeq samples has 36,269,180 uniquely mapped genes (as assessed by STAR), but following HTseq-count 6,065,105 (16%) of the reads show __no_feature
and 19,474,101 (54%) show __alignment_not_unique
. This means only 30% of original reads are mapped to the genes; Isn't this too low?
Thank You,
Farnoud
I've found that the fraction of intergenic reads varies considerably with the kit used for RNA-seq library prep. Do you know which kit(s) were used to prepare your libraries?
You might also want to consider running RNA-SeQC (Broad Institute), which will calculate the fraction of reads that are exonic, intronic, intragenic etc., and also the fraction of rRNA reads.
Thanks a lot Chris for your reply. It is very useful! I will figure out the kit and let you know.
54% does seems like a lot. Mine was around 15~23% for not unique and 6~13% for no feature. And the command I used for htseq is much simpler:
Thanks Sam. I too think these numbers are a lot!; But couldn't find a reference to compare these to.
Most of the parameters above are just default ones (except
-m=non-empty
,-s=no
and-a=0
, which I modified to lower the stringency of htseq-count).As for ambiguous calls, I don't think there is much I can do as it depends on how the reads are located, but the 16% no-feature (i.e. mapping to no exons/genes) is what worries me the most. I have checked several times that .gtf and .fa files are both the same Ensembl built, and don't know where else this problem may come from?
I wonder if this is due to rRNA or tRNA. Those are often not annotated in the genome, so if you have a sample with a really high amount of them then I could see this happening. That would indicate to me that something went wrong during library prep. You might look at where some of these are mapping along side a repeatmasker track. That'd help resolve whether this is the case. In either case, I suspect that you'll end up excluding this sample from further analysis.
but 36 million reads mapped on mouse genome uniquely, thats good number for downstream analysis
The problem is, even with a lot of uniquely mapped read, the high amount of non-unique reads might still suggest a possibility of having technical errors in library prep or other up-stream process, and you won't be certain whether if they will or will not affect the down-stream analysis. Maybe those uniquely aligned reads have special properties?
I guess for now, the best way to do is to follow the instruction provided by #Devon Ryan and hopefully understand whether if it is the rRNA / tRNA problem....
Agreed. I think that if the weird metrics are due to something like rRNA or some other species then perhaps the sample is salvageable. If, however, it's due to background DNA contamination, then you're unlikely to get anything useful from the samples, at least if the goal is standard differential expression. The reason for this is that the normal (edgeR/DESeq2) sample normalization methods aren't designed to compensate for this (they basically perform a scaling with a weight, but there's also an offset here).
Thank you all for your feedback and specially Devon for your insightful recommendation. Thinking about what/how these numbers come from, I came up with another ambiguity: I am using the output of STAR that only has "unique reads" (double checked the number of reads in STAR Aligned.sam output is the same as the reported number of unique reads in Log.final.out). Then, how it is possible that 54% of the reads that are aligned "uniquely" turn out as "aligned not unique" by HTSeq? Does that mean that the Ensembl gene model I am using for HTSeq-count (Mus_musculus.NCBIM37.67.gtf) has some issues and need to be curated before using it?
It depends on what STAR and htseq-count are describing as unique. Do a
samtools view -c -q 5 foo.bam
and see if the result from that matches what STAR is telling you. That should also match the results ofsamtools view foo.bam | grep -cw "NH:i:1"
. HTSeq-count is basing things off of the NH aux tag.Devon, I think I am not estimating the numbers correctly:
According to
STAR final.log
, there are 41,539,731 total reads in this RNASeq paired-end experiment, from which 34,373,992 are uniquely mapped to mm9.Since, my data is paired-end, I modified the command you advised above and did:
which gives me 66,183,904. I assume this should be the total number of properly mapped reads that both ends map to only to 1 region, right? So dividing this by 2 should give me the number of unique properly mapped paired reads (33,091,95). However the latter is not:
(1) equivalent to what STAR reported as "uniquely mapped" reads (34,373,992); nor
(2) what HTSeq numbers suggest:
19,474,101 (htseq.count
alignment_not unique
) - 41,539,731 (total number of reads based on STAR) = 22,065,630There must be something I am doing wrong here, but don't know exactly what it is! Appreciate your help.
Thanks,
Noushin
PS. I also estimated the samtools command without filtering
-f 2
flag, but it still was less than the number that STAR reported.