Is the Picard SortSam queryname or coordinate equivalent to samtools sort -n?
3
4
Entering edit mode
10.6 years ago
jon.brate ▴ 310

I don't quite understand what is meant by "coordinate" and "queryname" in the Picard SortSam, and which of them are equivalent to the samtools sort -n option?

samtools picard sort • 16k views
ADD COMMENT
11
Entering edit mode
10.6 years ago

Coordinate is the same as samtools sort

Queryname is equivalent to samtools sort -n

BTW, as pointed out by Dariober, queryname sorting just guarantees that alignments from the same reads (or read pairs) will be grouped together. Picard/samtool/sort have slightly different methods of sorting strings, so the exact ordering of the groups may differ between them.

ADD COMMENT
6
Entering edit mode
8.8 years ago
John 13k

Just as a bit of colour for this post, the three ways query name can be sorted as pointed out by dariober are:

samtools -n: Uses a natural sort. The check in python would be:

def nat(qname): return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', qname)]
it = iter(qnames); it.next()
all(nat(b) >= nat(a) for a, b in itertools.izip(qnames, it))

picard SortSam SORT_ORDER=queryname: Uses an ASCII sort:

it = iter(qnames); it.next()
all(b >= a for a, b in itertools.izip(qnames, it))     # returns True or False accordingly

unix sort: Uses a "dictionary sort" which is unix for "ASCII sort for only alphanumeric values":

def alphanumeric(qname): [ s for s in qnames[0] if s.isalnum() ]
it = iter(qnames); it.next() # always 1 ahead
all(alphanumeric(b) >= alphanumeric(a) for a, b in itertools.izip(qnames, it))

When sorting by position, fortunately samtools and Picard are exactly the same. This is because they don't try and sort the chromosomal order (using the same incompatible method as the qname), but instead use whatever is already in the head of the SAM/BAM file. Since neither tool lets you sort without a header, we get compatibility here :) Also, I think most mappers return the chromosomes in natural-sort order (chr1, chr2, ... chr10), which is nice. Maybe thats just how the genome.fa comes.

With unix's sort -k3,3 -k4,4n however, which you'll see scattered around the place for sorting SAM files, you'll get chromosomes in the same dictionary-sort order as unix sort gives the qnames (chr1, chr10, ... chr2). This of course is a really really bad idea.

ADD COMMENT
1
Entering edit mode

samtools -n: Uses an ASCII sort.

No, this is certainly not true, see e.g. https://github.com/samtools/hts-specs/issues/5

Samtools does a natural sort per subfield (which are separated by ':'), i.e. lexicographic sort for string-subfields, and numeric sort for integer sub-field. (Very annoying BTW, as it means you can't use Unix tools like comm and join).

I believe picardtools sort -n do lexigraphic sort, like Unix. But I'll check it first.

ADD REPLY
0
Entering edit mode

Sorry, you're absolutely right - even though I wrote a tool to detect sort order, I got samtools and picard the wrong way around in the post.

unix sort being different from picard and samtools is still true though, as is everything about the headers, etc.

enter image description here

enter image description here

ADD REPLY
2
Entering edit mode

Hi, I got confused a bit myself regarding the difference between picard sort order and Unix sort order. It turns out to be slightly nasty. Picard sort uses ASCII sort order, whereas the Unix sort uses 'lexicographic' order, which depends on the locale. The nasty bit is that in many locales, including the very commonly used en_US (and derivatives), the ':' character sorts before the numbers 0-9, whereas in ASCII, the ':' sorts after the numbers 0-9. In the context of SAM read identifier fields, this is clearly a very significant difference (which you, incidentally, may not notice in the first few hundred lines when comparing visually). What to do? Well simply switch your locale to ASCII, by doing

export LC_ALL=C; # now sort/join/compare stuff here

or, temporarily, along the lines of

samtools view file.bam | env LC_ALL=C sort | samtools view -b > file-sorted.bam

If you do it like this (and you also use the LC_ALL=C setting when using join (1) or comm (1), things seem to work.

ADD REPLY
0
Entering edit mode

Ay, you're right again - I thought unix sort does by default a dictionary sort (-d), because when i sort files with unix sort on linux I get back something that fails a ASCII sort test, but passes a dictionary sort test. But actually it seems the default is as you say, an ASCII sort that just so happened to fail a python ASCII sort (due to locale) and pass a dictionary sort, because it removes the ambiguous characters like ':'

So i should change my code to check for ASCII sorts of multiple locales. Thank you plijnzaad :) That would have been a very difficult one to debug without your help :)

EDIT: Also, I still think you should never-ever-ever do a unix sort :P (or even a sort on a SAM file at all).

ADD REPLY
3
Entering edit mode
10.6 years ago

queryname seems to be kind of equivalent to samtools sort -n Or I'm missing something?

java -jar SortSam.jar INPUT=test.bam OUTPUT=test.qpic.bam SO=queryname
samtools sort -n test.bam test.nsam

samtools output

samtools view test.nsam.bam | cut -f1 | head
D00408:131:1:1101:1268:49755#TAAGGCGA#NNCTCGGC
D00408:131:1:1101:1270:20060#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1270:20060#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1291:59756#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1291:59756#TAAGGCGA#NNGTCGGC
D00408:131:1:1101:1314:9343#TAAGGCGA#NCCTCCGC
D00408:131:1:1101:1314:9343#TAAGGCGA#NCCTCCGC
D00408:131:1:1101:1323:88558#TAAGGCGA#NCGTCGGC
D00408:131:1:1101:1323:88558#TAAGGCGA#NCGTCGGC
D00408:131:1:1101:1338:23494#TAAGGCGA#NCGTCCGC

picard output

samtools view test.qpic.bam | cut -f1 | head
D00408:131:1:1101:10087:85495#TAAGGCGA#TCCTCGGC
D00408:131:1:1101:10087:85495#TAAGGCGA#TCCTCGGC
D00408:131:1:1101:10092:35208#TAAGGCGA#TCCTCCGC
D00408:131:1:1101:10092:35208#TAAGGCGA#TCCTCCGC
D00408:131:1:1101:10104:60166#TAAGGCGA#TCGTCGGC
D00408:131:1:1101:10104:60166#TAAGGCGA#TCGTCGGC
D00408:131:1:1101:10120:16442#TACGGCGA#TCCTCGGC
D00408:131:1:1101:10120:16442#TACGGCGA#TCCTCGGC
D00408:131:1:1101:10125:56648#TAAGGCGA#TCCTCCGC
D00408:131:1:1101:10169:64089#TAAGGCGA#TCCTCGGC

And neither passes the unix sort check, at least on my system settings:

samtools view test.qpic.bam | cut -f1 | sort -c
sort: -:398: disorder: D00408:131:1:1101:1414:61894#TAAGGCGA#NCGTCGGC
samtools view test.nsam.bam | cut -f1 | sort -c
sort: -:332: disorder: D00408:131:1:1101:5242:78722#TAAGGCGA#TCCTCCGC
ADD COMMENT
1
Entering edit mode

The read is the query, so "queryname" literally means "read name". This is also why the read name field in a SAM/BAM file is labeled qname, as opposed to rname, which is the "reference name" aka the chromosome/contig name.

ADD REPLY
0
Entering edit mode

Mmm... Sure... In any case SortSam.jar SO=queryname and samtools sort -n should sort by read name, but in my example above they do it in a different way, which is also different from unix sort

ADD REPLY
1
Entering edit mode

I just updated my own answer to mention this, in fact. There are a lot of ways that one can think to order strings. Samtools uses strnum_cmp, while sort uses strcmp, and apparently picard uses yet another method. The point of sorting by queryname is just to group alignments from the same reads/read pairs, so any of these methods will work, even if they do give somewhat different results.

ADD REPLY
0
Entering edit mode

If you're ever interested, have a look at this thread from the samtools email list.

ADD REPLY
0
Entering edit mode

BTW, the difference in sorting is due to the many different ways of sorting strings of variable length.

ADD REPLY

Login before adding your answer.

Traffic: 3461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6