Chromosome sort order for bedops?
1
0
Entering edit mode
8.0 years ago
ariel.balter ▴ 260

I'm getting an error from bedops telling me that my bed files are not properly sorted by the chromosome column. In the documentation examples, they show using chr1 etc. (i.e. having the chr prefix). But my file seems properly sorted to me. Does it want actually a lexicographic sort?

balter$ bedops -i --header /home/peaks/mock_PROT_A_peaks.bed /home/peaks/mock_PROT_B_peaks.bed > test
May use bedops --help for more help.

Error: in /home/peaks/mock_PROT_A_peaks.bed
Bed file not properly sorted by first column.
See row: 30306

However,

balter$ awk 'FNR>30300 && FNR < 30310' /home/peaks/mock_PROT_A_peaks.bed
chr9    134907600   134907999   22  11  25.4418777792   2.18579500325   0.000323032637977   0.000563316911961
chr9    135992800   135993699   45  11  34.6380493632   4.47094432483   1.32335307676e-16   1.90501094493e-15
chr9    137355100   137355799   35  22  29.6971256766   1.73870057077   0.000897204802733   0.00140794992995
chr9    137845000   137846499   106 83  104.265196268   1.39574861653   0.000447638610513   0.000754290268942
chr9    138231400   138233499   92  91  84.425639067    1.10490736428   0.155735020989  0.164059393875
chr10   9900    10499   173 139 75.4529239821   1.36022494807   4.74667469274e-05   0.000101396628238
chr10   986300  988799  53  5   40.0327338617   11.5847135172   2.23116300204e-38   1.26287874397e-36
chr10   3468000 3468899 31  8   27.783851556    4.2349778188    1.48954143838e-11   1.17880566911e-10
chr10   3780600 3786299 219 45  188.650887904   5.31876784125   1.16948486655e-84   1.97787366782e-82

and

balter$ cut -f1 /home/peaks/mock_BRD4_A_peaks.bed | uniq
# Chromosome
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
bedops • 3.3k views
ADD COMMENT
1
Entering edit mode

This thread seems to match your question: http://seqanswers.com/forums/showthread.php?t=43409

ADD REPLY
3
Entering edit mode
8.0 years ago

Your chromosomes are not in lexicographical ordering. In your example, elements on chr10 should come before chr9, though that might seem a little counterintuitive, at first. The reason for doing things this way is that, under the hood, it is generally easier and faster to compare two strings lexicographically.

You can use BEDOPS sort-bed to prepare your BED files:

$ sort-bed peaks.unsorted.bed > peaks.bed

There are other sorting tools out there, like GNU sort, but they don't sort BED data as fast as sort-bed, which can be an issue for whole-genome scale files.

You only need to sort once, as BEDOPS tools read and write sorted data.

ADD COMMENT
0
Entering edit mode

Thanks! I ended up using sort-bed just to give me an idea of what bedops would want, and found exactly what you have pointed out. Is there any explanation for why a program would want to operate on genomic data in an order different than 1) numerically logical and 2) order of the actual genome? That's what threw me. As a novice, I would not have guessed that anyone used alphanumeric sorting except by accident.

ADD REPLY
1
Entering edit mode

What comes "naturally" to humans is not always a natural or efficient process for a computer. Alphanumeric sorting usually involves comparison of strings one character at a time, which is generally faster than parsing strings into tokens, converting tokens into substrings of characters, and then running comparisons on each of the subcomponents. In the general case, a numerical sort of the kind that humans do naturally would require regular expressions or pattern matching to split a string into tokens like chr and a string of numerical characters that has to be turned into an integer, before further comparison tests.

ADD REPLY

Login before adding your answer.

Traffic: 2235 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6