Question

Using HTSeq-count for paired-end data but unsorted by SAMTOOLS

0

Entering edit mode

3.2 years ago

ChocoParrot ▴ 20

Hi there, as per thread title.

If I am using HTSeq-count on paired-end mapped BAM files, but they are unsorted, and I use -s yes on the default option, is it advisable?

htseq ngs • 1.6k views

ADD COMMENT • link 3.2 years ago by ChocoParrot ▴ 20

score 1 · Accepted Answer · 2021-09-25

1

Entering edit mode

3.2 years ago

Carlo Yague 8.9k

Paired-end .bam need to be sorted either by read name or by alignment position before using HTseq-counts. You can use samtools sort to sort it, then use the -r option in HTseq-counts to specify whether the bam file is sorted by read name (name) or by alignment position (pos).

The -s option of HTseq-counts is completely unrelated to the issue of sorting. It is used to specify whetherf the paired-end data is stranded or not, which depends on sequencing library preparation.

-s <yes/no/reverse>, --stranded=<yes/no/reverse> whether the data is from a strand-specific assay (default: yes) For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.

ADD COMMENT • link 3.2 years ago by Carlo Yague 8.9k

0

Entering edit mode

So in a sense if I do not sort the output, and just plop it into HTSeq-counts, will the output be inaccurate?

ADD REPLY • link 3.2 years ago by ChocoParrot ▴ 20

1

Entering edit mode

Never mind, figured it out.

If name is indicated, htseq-count expects all the alignments for the reads of a given read pair to appear in adjacent records in the input data. For pos, this is not expected; rather, read alignments whose mate alignment have not yet been seen are kept in a buffer in memory until the mate is found. While, strictly speaking, the latter will also work with unsorted data, sorting ensures that most alignment mates appear close to each other in the data and hence the buffer is much less likely to overflow.

Thanks a lot!

ADD REPLY • link 3.2 years ago by ChocoParrot ▴ 20