htseq-count for pair-end RNA-seq
2
0
Entering edit mode
10.6 years ago
Ming Tommy Tang ★ 4.5k

Hi there,

I used htseq-count to count the pair-end RNA-seq data.

at the end, there are warnings like

Warning: 8612298 reads with missing mate encountered.

my question is how htseq-count deals with the pairs with only one end existing in the sam file?

does htseq-count simply discard those counts?

Thanks,
Ming

RNA-Seq • 14k views
ADD COMMENT
0
Entering edit mode
are you sure that these reads really don't have a mate in the sam file? If you didn't sort the sam file by read name but you told htseq it is, htseq thinks there are lot of missing mates
ADD REPLY
0
Entering edit mode

I sorted it with samtools sort -n by name. any ideas? Thank you.

ADD REPLY
1
Entering edit mode
10.6 years ago
Ido Tamir 5.2k

HtSeq count counts them as 1

ADD COMMENT
0
Entering edit mode

I know HTseq count one pair of reads as 1, but how about one pair missing the other end? Does HTseq discard it or still count it as 1? Thank you.

ADD REPLY
2
Entering edit mode

This is what I meant. My investigations made me believe that each read of a read pair where only one is mapped was counted as 1.

You could check by randomizing your bam file or simply taking the position sorted one. I bet, you will have 2x read count per gene. Please try it and tell me the result. - But I think the last version does not need a read name sorted one anymore, so try it with an old version.

ADD REPLY
0
Entering edit mode

what do you mean by randomizing the bam file? It is now sorted by name.

ADD REPLY
1
Entering edit mode

I mean, you can simply test the question how htseq-count counts paired end reads where only one was mapped by sorting this file by position or by randomly sorting it, then doing htseq-count and compare name sorted vs randomly sorted counts. randomly sorted counts should be 2x name sorted counts (in old versions of htseq-count). Or you filter the name sorted by read number (take only 1. read) and then count. Counts should be the same. This should work with all versions of htseq-count and is the better one. Or maybe not, because the "mate aligned flag" is set in those reads.

ADD REPLY
1
Entering edit mode
9.8 years ago

According to htseq documentation : http://www-huber.embl.de/users/anders/HTSeq/doc/counting.html

The fact that the records describe the same fragment can be seen from the fact that they have the same read name

So Tophat for instance is ouputing read paris with different names. (Adding 1 or 2 at the end of the name)

So simply do that:

(samtools view -H in.bam; in.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^   ]*//') | samtools view -bSh - > in.samereadname.bam
ADD COMMENT
1
Entering edit mode

just a small correction for people passing by, the line should be:

(samtools view -H accepted_hits.bam; samtools view accepted_hits.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^   ]*//') | samtools view -bSh -> in.samereadname.bam
ADD REPLY

Login before adding your answer.

Traffic: 3088 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6