Hello everyone, I have a question related to RNA-Seq and HTSeq-Counts. A very brief description:
There is a research paper that applies RNA-Seq methods to few samples, from 2016. For my project, I follow the same steps on the same samples they used.
No adapters, I run tophat2 with the reference genome, then samtools, then HTSeq count (htseq-count -r pos -f bam -s yes -m intersection-strict --stranded=no .......)
According to the paper, authors found read counts around ~25,000,000 . I find ~1,000,000 less counts than the paper, for each sample. If It is ~24M, I find ~23M.
Do you have any idea what can cause this? Thanks! ✌️
In my opinion that's to be expected. Unless you match every piece of software and version, including the same aligner, and the same commands, then yeah that's normal.
Thank you for your reply! Also, ~1M loss in ~20M sounds very minor to me, would you agree on this?
Most NGS aligners are non-deterministic i.e. you may not get exactly identical alignments if you run an alignment more than once. That said there are aligners (e.g.
bbmap
) that will allow you to do deterministic alignments. Rather than focusing on number of reads check the final DE result and see if that is largely concordant (it may not be 100% identical).I would agree but the details are in the devil. Why and when were the reads excluded? Does it affect DE? do the DE counts match? etc etc etc