Hi all,
I'm aiming to resort my BAM files by read mates, but the common solutions I've seen online do not produce what I need (example below). This post is building off of a similar past inquiry that (to my knowledge) went unresolved: Keeping paired reads together when sorting BAM file by name
FS10002072:15:BSE39216-1017:1:1101:1780:1000 99 x 194 44 151M = 305 262 TTCGCCCCTCCCGGGGTCCTGCGGCGGGTCGCCTGCCCTGCCCCCGAACCCCGCCTGGGGGCCGCGGTCGGCCCGGCGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF AS:i:275 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:16A21A19A17G74 YS:i:281 YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:1780:1000 147 x 305 44 147M4S = 194 -262 GTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCGAGGTTCTGGCCTTTCAGGCCGCAGGAAGAGGAACGGAGCGAGTCCCCGCGTGCGGCGCGATTCCCTGAGCTGTGGGACGTGCACCCAGGACTCGGCTCACACATGCTACT FFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFF,FFFFF::,::FFFFFFFF:FFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:281 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:47A43C55 YS:i:275 YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:3950:1000 99 x 194 44 151M = 305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCATGGAGGCCGCGGTCGGCTCGGCGCTTCTCAGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFF::FFF:FFFF:FFFFF:FFFFFF,FFFF AS:i:274 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:54C17C3G7C66 YS:i:281 YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:3950:1000 147 x 305 44 147M4S = 194 -262 GTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCGAGGTTCAGGCCTTTAAGGCCGCAGGAAGAGGAACGGAGCGAGTCCCCGCGCGTGGCGCGATTCCCTGAGCTGTGGGACGTGCACCCAGGACTCGGCTCACACATGCTGCC ,FFF,FFF,FFFFFFF::F:,F:F:FFFFF,F,FFFFFFF,FFF:FFFFFF:FF:FFF:FF,:FF:FFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF,FFFFFFFF AS:i:281 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:55C37C53 YS:i:274 YT:Z:CP
I've tried so far:
1) samtools sort -n
The output seems to be sorted by ascending read names, which has the effect of separating the mates. Example below:
FS10002072:15:BSE39216-1017:1:1101:1000:3240 99 x 194 44 151M = 305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGTGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF AS:i:295 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:110A40 YS:i:294 YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10040:3630 99 x 194 44 48M1D103M = 305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCAATGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGCFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:287 XN:i:0 XM:i:1 XO:i:1 XG:i:1 NM:i:2 MD:Z:48^C47C55 YS:i:294 YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10040:3870 99 x 194 44 151M = 305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGATCCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCAATGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:288 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:47A48C54 YS:i:281 YT:Z:CP
FS10002072:15:BSE39216-1017:1:1101:10080:3790 99 x 194 44 151M = 305 262 TTCGCCCCTCCCGGGGACCTGCGGCGGGTCGCCTGCCCAGCCCCCGAACCCCGCCTGGAGGCCGCGGTCGGCCCGGGGCTTCTCCGGAGGCACCCACTGCCACCGCGAAGAGTTGGGCTCTGTCAGCCGCGGGTCTCTCGGGGGCGAGGGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF AS:i:302 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:151 YS:i:274 YT:Z:CP
2) picard-tools SortSam
sorting by queryname
This produced the same result as example 1.
3) Rsubread's repair utility.
Strangely this changed the values of MAPQ and CIGAR for many reads to "0" and "*" respectively.
Your samtools example looks like it is sorted by position not by name. Are you sure it is the output of a
samtools sort -n
command?An alternative to using
samtools sort
to group by name issamtools collate
, though this does not guarantee the sort order between groups.That doesn't look like
sort -n
as it's position sorted. Are you sure?Also, I'd recommend
samtools collate
as a far faster way of grouping mates together, unless there is a specific reason why the names need to be in sorted order (rather than simply grouped together) or unless you need to randomise position order (as collate is still position correlated), eg when doing analysis of insert size via sampling the first X reads.What is your samtools version? Does that happen with the latest one?
ATpoint I'm using Samtools version 1.15.1
I updated to the latest (1.16.1) and the sorting behavior was unchanged to the previous version.