Is there any tools or commands to sort bed format file? The correct order is chr1, chr2, chr3, chr4,..., chr22, chrX, chrY.
Is there any tools or commands to sort bed format file? The correct order is chr1, chr2, chr3, chr4,..., chr22, chrX, chrY.
GNU sort can now sort in alpha-numeric order. See option -V, --version-sort
of the following version "sort (GNU coreutils) 8.17".
Here is an example:
$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC"
chr10 1 2 A
chr2 1 2 B
chr1 1 2 C
And the result is:
$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC" | sort -k1,1V
chr1 1 2 C
chr2 1 2 B
chr10 1 2 A
The easiest what I am using is :
sort -V -k1,1 -k2,2 test.bed
And you have the order of chrom:
1 2 3 4 5 6 7 8 9 10 ....
Example Bed file:
chr3 1 20
chr4 20 30
chr1 2 10
chr3 10 15
chrX 20 30
chrY 20 40
chrX 5 25
chr3 10 40
chr1 1 9
chr10 20 30
chr22 12 24
chr11 10 90
Sort Command:
sort -k1,1V -k2,2n -k3,3n "file"
-k1,1V
: Sort 1st field alphabetically, recognising the 10 (NB -V
is available in GNU coreutils >8.17 as pointed out by @tflutre above)-k2,2n
: Sort 2nd field numerically, loci which start first in a chromosome come first-k3,3n
: Sort 3rd field numerically, loci which end first come first when they have the same start position.Result:
chr1 1 9
chr1 2 10
chr3 1 20
chr3 10 15
chr3 10 40
chr4 20 30
chr10 20 30
chr11 10 90
chr22 12 24
chrX 5 25
chrX 20 30
chrY 20 40
To make this command available as a short hand e.g. sortbed
in your environment, edit ~/.bashrc
and add:
# sortbed (or some other description)
alias sortbed="sort -k1,1V -k2,2n -k3,3n"
and run source ~/.bashrc
invoking:
sortbed "file"
should now have the same effect as the full command.
make
./sort -k1,1N -k2,2n unsrt.bed > srt.bed
It sorts chromosome names to the alpha-numeric order.
EDIT: -k1,1N
is NOT a typo. It is a new sorting order - alphanumeric order - only available in my version of sort but not in the standard Unix sort. If you run:
echo chr10 chr5 chrX chr2 | tr " " "\n" | ./sort -N
You will get:
chr2
chr5
chr10
chrX
Let's say you have the following data:
$ more foo.bed
chr1 1 2
chr4 7 8
chrX 100 101
chr11 9 100
chr11 9 99
chr20 11 12
chr2 3 4
chr3 5 6
You'd probably want to sort on the first column, then sort on both the second and third columns:
$ sort -k1,1 -k2,2n -k3,3n foo.bed
chr1 1 2
chr2 3 4
chr3 5 6
chr4 7 8
chr11 9 99
chr11 9 100
chr20 11 12
chrX 100 101
Sorting only on the first and second columns may not guarantee expected ordering, where the start coordinates are equal. To demonstrate, take a look at the chr11
records below, where the stop coordinates are presumably out-of-order (100 is greater than 99, but is printed first with this use of UNIX sort
):
$ sort -k1,1 -k2,2n foo.bed
chr1 1 2
chr2 3 4
chr3 5 6
chr4 7 8
chr11 9 100
chr11 9 99
chr20 11 12
chrX 100 101
In any case, note that neither of these two orderings would be the usual "correct" sort order that would go into BEDOPS tools (or now perhaps bedtools' merge operation), which work most efficiently and correctly with sort-bed
ordering:
$ sort-bed foo.bed
chr1 1 2
chr11 9 99
chr11 9 100
chr2 3 4
chr20 11 12
chr3 5 6
chr4 7 8
chrX 100 101
Which sort you use depends on what tool or pipeline consumes the sorted output, and what ordering is expected.
Use sed
(with option -f
) to transform your chromosome names into a set of sortable strings, sort and retransform the names:
the original bed file:
$ cat input.bed
chr1 1 2
chr9 1 2
chr1 1 2
chr10 1 2
chr1 1 2
chr2 1 2
chr12 1 2
chr3 1 2
the sed file transforming the chrom to a sortable field (chr1
-> 01
; chr10
-> 10
... )
$ cat sed1.sed
s/^chr1 /01 /
s/^chr10 /10 /
s/^chr12 /12 /
s/^chr2 /02 /
s/^chr3 /03 /
s/^chr9 /09 /
the sed file for the reverse process: 01
-> chr1
; 10
-> chr10
$ cat sed2.sed
s/^01 /chr1 /
s/^10 /chr10 /
s/^12 /chr12 /
s/^02 /chr2 /
s/^03 /chr3 /
s/^09 /chr9 /
all in one:
$ sed -f sed1.sed < input.bed | sort -t ' ' -k1,1 -k2,12n | sed -f sed2.sed
chr1 1 2
chr1 1 2
chr1 1 2
chr2 1 2
chr3 1 2
chr9 1 2
chr10 1 2
chr12 1 2
I think JoinX does what you are requesting. It is a tool that The Genome Institute at Washington University has released to perform set operations on BED files and maybe other chr/pos formatted files.
Here's an example:
$ cat input.bed
chr1 1 2
chr9 1 2
chr1 1 2
chr10 1 2
chrX 1 2
chr1 1 2
chr2 1 2
chr12 1 2
chr3 1 2
$ joinx sort -i input.bed
chr1 1 2
chr1 1 2
chr1 1 2
chr2 1 2
chr3 1 2
chr9 1 2
chr10 1 2
chr12 1 2
chrX 1 2
try this function. It was written it in R for Arabidopsis' so the number of chromosomes are 5' but you can change it.
order.data = function (bed) #BED like data
{
o = order (bed[, "chr"])
bed = bed[o,]
#order data: od
od = 1:ncol (bed)
for (i in 1:5)
{
chr.bed = bed[bed[, "chr"]==i,]
o = order (as.integer(as.character(chr.bed[,"start"])))
od = rbind (od, chr.bed[o, ])
}
od = od [-1,]
colnames (od)= colnames(bed)
order.data = od
}
More simply way
sort -Vk1,2
With the OSX standard sort command, it probably does not work with the V option, so please use GNU sort.
I am using simply:
sort -k 1V,1 -k 2n,2 input.bed
To make a bed file match any genome hosted by ucsc, download or make the chom.sizes file for that genome. (e.g. Kent tools fetchChromSizes script or e.g. https://genome.ucsc.edu/goldenpath/help/hg19.chrom.sizes )
Then:
/me/tools/htslib/1.1/bin/bgzip --stdout ${inputBase}.noheader.noOverlap.bed >${inputBase}.noheader.noOverlap.bed.gz
/me/tools/htslib/1.1/bin/tabix -p vcf ${inputBase}.noheader.noOverlap.bed.gz
cat ${chromsizes} | cut -f1 | xargs /me/tools/htslib/1.1/bin/tabix -h ${inputBase}.noheader.noOverlap.bed.gz > ${outputBase}.bed
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
thank you so much for this information !
what a nice side-effect
what kind of side effect?