Does the BAM need to be sorted by name (sort -n) for DESeq2 pipeline?
1
0
Entering edit mode
9.4 years ago
weixiaokuan ▴ 140

Guys,

I have read several manuals and tutorials for RNA-Seq analysis by DESeq2. I found some tutorials instruct to sort the BAM (pair-end) by name. But some just ignore it. So, does the BAM (paired end) generated by TopHat2 need to be sorted by name?

I think they should but why the manual of DESeq2 or workflow completely ignore this part?

Thank you.

-X

RNA-Seq DESeq2 • 3.1k views
ADD COMMENT
2
Entering edit mode
9.4 years ago

DESeq2 takes read count data as an input. Normally people use HTSeq from the same group (Simon Anders, Wolfgang Hubers and colleagues) that have developed DESeq2 to generate read count data for genic features. HTSeq manual says: "For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name". So it doesn't matter if you bam file is sorted by name or by position, HTSeq should be able to use it to generate count data. Tophat2 produces BAM file that is sorted by position and it can be directly given to DESeq2 by using "pos" for order.

ADD COMMENT
0
Entering edit mode

What about using GenomicAlignments to create count matrix using "summarizeOVerlaps" instead of using ht-seq? does it need to be sorted?

As I know that GenomicAlignment kind of implementing the similar algorithm as ht-seq, but there is no similar -r option. Since the default option for ht-seq is sorted by name, for the safe side I sort the BAM by name before feed them to "summarizeOverlaps". Is this a general practice or it is not necessary to sort BAM?

ADD REPLY
0
Entering edit mode

Hi Wei, Sorry I have never used summarizeOverlaps for generating count data so don't know much about it. But as Ian mentioned in his comment that htseq-count may get confused if it doesn't find both partners, so I would guess that sorting by name won't hurt in case of paired-end data except it may add to the running time of the pipeline.

ADD REPLY
0
Entering edit mode

Just to add some experience to AP's answer: If the BAM files input to htseq-count are not sorted by name it gets sometimes gets confused and cannot find both partners of a pair. Of course there are different ways of counting reads into genes, so the sorting method may differ.

ADD REPLY

Login before adding your answer.

Traffic: 2051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6