Hi all,
I am currently using samtools to sort my bam files by positions (as default), then I used htseq to obtain read counts. Initially, I got massive 'Mate records missing' warnings. Then, I realized that htseq assumed the files were sorted by name, so I included the '-r pos' option and re-run the htseq. Then, I got less 'Mate records missing' warnings but they are still there...So my question would be: 1. Is there a way I can totally eliminate the warnings? 2. Which of the following pipeline is better?
- samtool sort by name + htseq without -r pos
- samtool sort by position + htseq with -r pos
I referred to the developer's posts: https://github.com/simon-anders/htseq/issues/37 but I still couldn't figure out how I should improve the process properly.
As @Devon suggested in an earlier question you should use
featureCounts
instead. It is much faster, can auto sort files as needed and will create an analysis ready count matrix from set of BAM files you provide to it making downstream import easy.Thanks! In that case I do not need to sort the bam file using samtools right?
The BAM file still needs to be sorted, and AFAIK there are slightly different requirements for paired-end (fragment) and single-end (read) quantification. Basically, featureCounts will try to fix the mate pairs if it detects inconsistencies, but it's much slower than actual read counting, so it's best to make sure your files are sorted correctly. Samtools has options to fix unpaired mate reads or remove unpaired reads altogether.
Thanks! Actually I have tried all name, postion, and unsorted bam files for featurecounts. The outputs were pretty much the same with minor differences.