Question

How To Sort Bed Format File

10

Entering edit mode

12.2 years ago

lyz10302012 ▴ 480

Is there any tools or commands to sort bed format file? The correct order is chr1, chr2, chr3, chr4,..., chr22, chrX, chrY.

bed • 67k views

ADD COMMENT • link updated 2.6 years ago by Pierre Lindenbaum 166k • written 12.2 years ago by lyz10302012 ▴ 480

16

Entering edit mode

10.3 years ago

Korsocius ▴ 270

The easiest what I am using is :

sort -V -k1,1 -k2,2 test.bed

And you have the order of chrom:

1 2 3 4 5 6 7 8 9 10 ....

ADD COMMENT • link updated 5.8 years ago by zx8754 12k • written 10.3 years ago by Korsocius ▴ 270

0

Entering edit mode

what does k1 and k2 means?

ADD REPLY • link 2.6 years ago by LDT ▴ 340

0

Entering edit mode

https://man7.org/linux/man-pages/man1/sort.1.html

ADD REPLY • link 2.6 years ago by Pierre Lindenbaum 166k

11

Entering edit mode

8.0 years ago

rjactonspsfcf ▴ 180

Example Bed file:

chr3    1   20
chr4    20  30
chr1    2   10
chr3    10  15
chrX    20  30
chrY    20  40
chrX    5   25
chr3    10  40
chr1    1   9
chr10   20  30
chr22   12  24
chr11   10  90

Sort Command:

sort -k1,1V -k2,2n -k3,3n "file"

-k1,1V: Sort 1st field alphabetically, recognising the 10 (NB -V is available in GNU coreutils >8.17 as pointed out by @tflutre above)
-k2,2n: Sort 2nd field numerically, loci which start first in a chromosome come first
-k3,3n: Sort 3rd field numerically, loci which end first come first when they have the same start position.

Result:

chr1    1   9
chr1    2   10
chr3    1   20
chr3    10  15
chr3    10  40
chr4    20  30
chr10   20  30
chr11   10  90
chr22   12  24
chrX    5   25
chrX    20  30
chrY    20  40

To make this command available as a short hand e.g. sortbed in your environment, edit ~/.bashrc and add:

# sortbed (or some other description)
alias sortbed="sort -k1,1V -k2,2n -k3,3n"

and run source ~/.bashrc

invoking:

sortbed "file"

should now have the same effect as the full command.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 8.0 years ago by rjactonspsfcf ▴ 180

0

Entering edit mode

I think it is not necessary to sort by -k3,3n... It is logical from definition of bed file.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 8.0 years ago by Korsocius ▴ 270

0

Entering edit mode

This only crops up when you have a bed file with overlapping features e.g.

chr1 1 10
chr1 1 5

would not be ordered as:

chr1 1 5
chr1 1 10

without sorting on -k3,3n

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 7.9 years ago by rjactonspsfcf ▴ 180

0

Entering edit mode

Yep, see: A: How To Sort Bed Format File

ADD REPLY • link 7.9 years ago by Alex Reynolds 36k

10

Entering edit mode

12.2 years ago

lh3 33k

Use my version of GNU sort:

make
./sort -k1,1N -k2,2n unsrt.bed > srt.bed

It sorts chromosome names to the alpha-numeric order.

EDIT: -k1,1N is NOT a typo. It is a new sorting order - alphanumeric order - only available in my version of sort but not in the standard Unix sort. If you run:

echo chr10 chr5 chrX chr2 | tr " " "\n" | ./sort -N

You will get:

chr2
chr5
chr10
chrX

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 12.2 years ago by lh3 33k

0

Entering edit mode

This returns the error

sort: stray character in field spec: invalid field specification ‘1,1N’

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 12.2 years ago by Alex Reynolds 36k

1

Entering edit mode

isn't this just a "N"->"n" typo?

ADD REPLY • link 12.2 years ago by daiefa123 ▴ 150

0

Entering edit mode

You need to use ./sort or specify the full path (you can also rename it). Don't use the system sort. "N" is a new feature which is not implemented in the standard GNU sort.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 12.2 years ago by lh3 33k

8

Entering edit mode

12.2 years ago

Alex Reynolds 36k

Let's say you have the following data:

$ more foo.bed
chr1    1       2
chr4    7       8
chrX    100     101
chr11   9       100
chr11   9       99
chr20   11      12
chr2    3       4
chr3    5       6

You'd probably want to sort on the first column, then sort on both the second and third columns:

$ sort -k1,1 -k2,2n -k3,3n foo.bed
chr1    1    2
chr2    3    4
chr3    5    6
chr4    7    8
chr11    9    99
chr11    9    100
chr20    11    12
chrX    100    101

Sorting only on the first and second columns may not guarantee expected ordering, where the start coordinates are equal. To demonstrate, take a look at the chr11 records below, where the stop coordinates are presumably out-of-order (100 is greater than 99, but is printed first with this use of UNIX sort):

$ sort -k1,1 -k2,2n foo.bed
chr1    1    2
chr2    3    4
chr3    5    6
chr4    7    8
chr11    9    100
chr11    9    99
chr20    11    12
chrX    100    101

In any case, note that neither of these two orderings would be the usual "correct" sort order that would go into BEDOPS tools (or now perhaps bedtools' merge operation), which work most efficiently and correctly with sort-bed ordering:

$ sort-bed foo.bed
chr1    1    2
chr11    9    99
chr11    9    100
chr2    3    4
chr20    11    12
chr3    5    6
chr4    7    8
chrX    100    101

Which sort you use depends on what tool or pipeline consumes the sorted output, and what ordering is expected.

ADD COMMENT • link 3.9 years ago by Alex Reynolds 36k

3

Entering edit mode

12.2 years ago

Pierre Lindenbaum 166k

Use sed (with option -f) to transform your chromosome names into a set of sortable strings, sort and retransform the names:

the original bed file:

$ cat input.bed
chr1    1    2
chr9    1    2
chr1    1    2
chr10    1    2
chr1    1    2
chr2    1    2
chr12    1    2
chr3    1    2

the sed file transforming the chrom to a sortable field (chr1 -> 01 ; chr10 -> 10 ... )

$ cat sed1.sed 
s/^chr1    /01    /
s/^chr10    /10    /
s/^chr12    /12    /
s/^chr2    /02    /
s/^chr3    /03    /
s/^chr9    /09    /

the sed file for the reverse process: 01 -> chr1 ; 10 -> chr10

$ cat sed2.sed 
s/^01    /chr1    /
s/^10    /chr10    /
s/^12    /chr12    /
s/^02    /chr2    /
s/^03    /chr3    /
s/^09    /chr9    /

all in one:

$ sed -f sed1.sed < input.bed | sort  -t '     ' -k1,1 -k2,12n | sed -f sed2.sed 
chr1    1    2
chr1    1    2
chr1    1    2
chr2    1    2
chr3    1    2
chr9    1    2
chr10    1    2
chr12    1    2

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 12.2 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

12.2 years ago

nnutter ▴ 210

I think JoinX does what you are requesting. It is a tool that The Genome Institute at Washington University has released to perform set operations on BED files and maybe other chr/pos formatted files.

Here's an example:

$ cat input.bed 
chr1    1       2
chr9    1       2
chr1    1       2
chr10   1       2
chrX    1       2
chr1    1       2
chr2    1       2
chr12   1       2
chr3    1       2
$ joinx sort -i input.bed                                                       
chr1    1       2
chr1    1       2
chr1    1       2
chr2    1       2
chr3    1       2
chr9    1       2
chr10   1       2
chr12   1       2
chrX    1       2

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 12.2 years ago by nnutter ▴ 210

1

Entering edit mode

9.4 years ago

nurithec ▴ 20

try this function. It was written it in R for Arabidopsis' so the number of chromosomes are 5' but you can change it.

order.data = function (bed) #BED like data
{
  o = order (bed[, "chr"])
  bed = bed[o,]
  #order data: od
  od = 1:ncol (bed)
  for (i in 1:5)
  {
    chr.bed = bed[bed[, "chr"]==i,]
    o = order (as.integer(as.character(chr.bed[,"start"])))
    od = rbind (od, chr.bed[o, ])
  }
  od = od [-1,]
  colnames (od)= colnames(bed)
  order.data = od
}

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.4 years ago by nurithec ▴ 20

1

Entering edit mode

7.7 years ago

chikaharu ▴ 10

More simply way

sort -Vk1,2

With the OSX standard sort command, it probably does not work with the V option, so please use GNU sort.

ADD COMMENT • link 7.7 years ago by chikaharu ▴ 10

1

Entering edit mode

7.6 years ago

Paul ★ 1.5k

I am using simply:

sort -k 1V,1 -k 2n,2 input.bed

ADD COMMENT • link 7.6 years ago by Paul ★ 1.5k

0

Entering edit mode

9.5 years ago

rhileighalmgren ▴ 20

To make a bed file match any genome hosted by ucsc, download or make the chom.sizes file for that genome. (e.g. Kent tools fetchChromSizes script or e.g. https://genome.ucsc.edu/goldenpath/help/hg19.chrom.sizes )

Then:

/me/tools/htslib/1.1/bin/bgzip --stdout ${inputBase}.noheader.noOverlap.bed >${inputBase}.noheader.noOverlap.bed.gz
/me/tools/htslib/1.1/bin/tabix -p vcf ${inputBase}.noheader.noOverlap.bed.gz
cat ${chromsizes} | cut -f1 | xargs /me/tools/htslib/1.1/bin/tabix -h ${inputBase}.noheader.noOverlap.bed.gz > ${outputBase}.bed

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by rhileighalmgren ▴ 20

Ram · Accepted Answer · 2013-09-16

17

Entering edit mode

11.7 years ago

tflutre ▴ 590

GNU sort can now sort in alpha-numeric order. See option -V, --version-sort of the following version "sort (GNU coreutils) 8.17".

Here is an example:

$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC"
chr10   1       2       A
chr2    1       2       B
chr1    1       2       C

And the result is:

$ echo -e "chr10\t1\t2\tA\nchr2\t1\t2\tB\nchr1\t1\t2\tC" | sort -k1,1V
chr1    1       2       C
chr2    1       2       B
chr10   1       2       A

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 11.7 years ago by tflutre ▴ 590

0

Entering edit mode

thank you so much for this information !

ADD REPLY • link 11.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

what a nice side-effect

ADD REPLY • link 9.0 years ago by Michi ▴ 990

0

Entering edit mode

what kind of side effect?

ADD REPLY • link 8.6 years ago by Dataman ▴ 380