I am using htseq-count for getting the read count of my genes from RNASeq data.While I have got the read counts but I am getting the warning below:
Warning: 35062120 reads with missing mate encountered.
35138401 SAM alignment pairs processed.
I am guessing this could be because of how I sorted my sam file.I did not specify any parameters while sorting.This was the command I used(gnu parallel). {1} is used to specify my input files.
samtools sort - -m 40G {1/.}.sorted
My question is do I have to sort by any specific parameter such as -n ?For example do I need to rerun sorting and then produce results of htseq-count.
To be honest, I have used sorted as well as unsorted files in htseq-count and it gave me the same warnings. But either way it works fine. You can just ignore the warnings.
Actually you are right. Samtools manual says it is position, and htseq-count says it is by name. It's confusing.
From htseq-count manual:
For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name.
If name is indicated, htseq-count expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in a buffer in memory until the mate is found. While, strictly speaking, the latter will also work with unsorted data, sorting ensures that most alignment mates appear close to each other in the data and hence the buffer is much less likely to overflow.
First check how many reads have proper mates. May be using samtools flagstat
If you find many reads with missing mates, like shown in htseq warning, there are no issues with htseq. If you find most of the reads have their mates aligned, then
Sort the samfile either by position ( default ) or name ( -n ). Then use latest version of htseq and set the -r option to eithe name ( -r name [default] ) or position ( -r pos ) based on how you have sorted the bam file. Make sure you have used the strandedness (-s) option appropriately.
htseq-count requires it sorted but it can be either name or alignment position. You can ignore the warnings. Use -q to suppress warnings.
So,can i just ignore the warning?
To be honest, I have used sorted as well as unsorted files in htseq-count and it gave me the same warnings. But either way it works fine. You can just ignore the warnings.
Okay.What is the default sort if you dont specify any parameter?Is it by name or position
name. Also read htseq-count for pair-end RNA-seq post.
I think the default sorting is by position not by name
Actually you are right. Samtools manual says it is position, and htseq-count says it is by name. It's confusing.
From htseq-count manual:
From samtools specification:
What version of htseq are you using? it's true that it can work also with coordinate sorted bam files, but only the latest versions have that option.
My suggestion is that you try to sort it by name and see if you get the same warnings. I wouldn't be confident in simply ignoring them
I am using HTSeq-0.6.1,its the latest version.What option can I specify for coordinated sorted bam files?