Sort BAM file by read mates
0
0
Entering edit mode
2.0 years ago

Hi all,

I'm aiming to resort my BAM files by read mates, but the common solutions I've seen online do not produce what I need (example below). This post is building off of a similar past inquiry that (to my knowledge) went unresolved: Keeping paired reads together when sorting BAM file by name

FS10002072:15:BSE39216-1017:1:1101:1780:1000    99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGTCCTGCGGCGGGTCGCCTGCCCTGCCCCCGAACCCCGCCTGGGGGCCGCGGTCGGCCCGGCGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF AS:i:275    XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4  MD:Z:16A21A19A17G74 YS:i:281    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:1780:1000    147 x   305 44  147M4S  =   194 -262    GTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCGAGGTTCTGGCCTTTCAGGCCGCAGGAAGAGGAACGGAGCGAGTCCCCGCGTGCGGCGCGATTCCCTGAGCTGTGGGACGTGCACCCAGGACTCGGCTCACACATGCTACT FFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF,FFFFF::,::FFFFFFFF:FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:281    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:47A43C55   YS:i:275    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:3950:1000    99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCATGGAGGCCGCGGTCGGCTCGGCGCTTCTCAGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFF::FFF:FFFF:FFFFF:FFFFFF,FFFF AS:i:274    XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4  MD:Z:54C17C3G7C66   YS:i:281    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:3950:1000    147 x   305 44  147M4S  =   194 -262    GTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCGAGGTTCAGGCCTTTAAGGCCGCAGGAAGAGGAACGGAGCGAGTCCCCGCGCGTGGCGCGATTCCCTGAGCTGTGGGACGTGCACCCAGGACTCGGCTCACACATGCTGCC ,FFF,FFF,FFFFFFF::F:,F:F:FFFFF,F,FFFFFFF,FFF:FFFFFF:FF:FFF:FF,:FF:FFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF,FFFFFFFF AS:i:281    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:55C37C53   YS:i:274    YT:Z:CP

I've tried so far: 1) samtools sort -n

The output seems to be sorted by ascending read names, which has the effect of separating the mates. Example below:

FS10002072:15:BSE39216-1017:1:1101:1000:3240    99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGTGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF AS:i:295    XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:110A40 YS:i:294    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10040:3630   99  x   194 44  48M1D103M   =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCAATGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   AS:i:287    XN:i:0  XM:i:1  XO:i:1  XG:i:1  NM:i:2  MD:Z:48^C47C55  YS:i:294    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10040:3870   99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGATCCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCAATGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:288    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:47A48C54   YS:i:281    YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10080:3790   99  x   194 44  151M    =   305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF AS:i:302    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:151    YS:i:274    YT:Z:CP

2) picard-tools SortSam sorting by queryname

This produced the same result as example 1.

3) Rsubread's repair utility.

Strangely this changed the values of MAPQ and CIGAR for many reads to "0" and "*" respectively.

samtools alignment sequencing • 1.3k views
ADD COMMENT
1
Entering edit mode

Your samtools example looks like it is sorted by position not by name. Are you sure it is the output of a samtools sort -n command?

An alternative to using samtools sort to group by name is samtools collate, though this does not guarantee the sort order between groups.

ADD REPLY
1
Entering edit mode

That doesn't look like sort -n as it's position sorted. Are you sure?

Also, I'd recommend samtools collate as a far faster way of grouping mates together, unless there is a specific reason why the names need to be in sorted order (rather than simply grouped together) or unless you need to randomise position order (as collate is still position correlated), eg when doing analysis of insert size via sampling the first X reads.

ADD REPLY
0
Entering edit mode

What is your samtools version? Does that happen with the latest one?

ADD REPLY
0
Entering edit mode

ATpoint I'm using Samtools version 1.15.1

I updated to the latest (1.16.1) and the sorting behavior was unchanged to the previous version.

ADD REPLY

Login before adding your answer.

Traffic: 2033 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6