awk one line command
4
1
Entering edit mode
6.9 years ago
prasoonagar ▴ 10

I want to break my bed file into the chromosome-specific files containing the data of respective chromosome so I used the following command

 for chr in {1..22}; do awk -v chr=$chr '$1 == chr' data.txt > data_chr$chr.txt; done

The issue is that the files created do not contain any information.

My bed file has six columns and is having the following format

chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr1    10165   10215   382_570_1514    9   +
chr1    10170   10196   664_789_434 1   +
chr1    10171   10221   439_96_1067 10  +
chr1    10174   10223   378_525_47  5   +
chr1    10174   10224   94_430_1386 6   +
chr1    10217   10279   4_218_1107  5   +
chr1    10224   10267   138_1957_415    59  -
chr1    10242   10272   638_1023_1664   4   +

Any suggestions will be highly appreciated

awk • 5.5k views
ADD COMMENT
6
Entering edit mode
6.9 years ago

You need to prefix the chr variable with the string literal chr, as right now you are asking awk to match the first field on numbers (1 through 22) and not chromosome names (chr1 through chr22).

$ for chr in {1..22}; do awk -v chr="chr${chr}" ...

But BEDOPS offers a much faster and simpler way of splitting BED files with bedextract, which uses a binary search approach on sorted BED files to jump to the start of a chromosome:

$ fn=intervals.bed
$ for chr in `bedextract --list-chr ${fn}`; do bedextract ${chr} ${fn} > ${fn}.${chr}.bed; done

With awk or grep, you need to re-read through the file linearly on every chromosome, which wastes time, especially on very large, whole-genome scale BED files. Use BEDOPS if your time is valuable to you.

ADD COMMENT
2
Entering edit mode
6.9 years ago

Using awk (same thing can be done using parallel. Please look at the end of the post for parallel):

$ awk -F "\t" '{print >$1".bed"}' test.bed

I modified OP bed as follows to split: input:

$ cat test.bed
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr1    10165   10215   382_570_1514    9   +
chr1    10170   10196   664_789_434 1   +
chr1    10171   10221   439_96_1067 10  +
chr2    10174   10223   378_525_47  5   +
chr2    10174   10224   94_430_1386 6   +
chr3    10217   10279   4_218_1107  5   +
chr3    10224   10267   138_1957_415    59  -
chr3    10242   10272   638_1023_1664   4   +

output:

$ ls chr*
chr1.bed  chr2.bed  chr3.bed

post splitting, chr1.bed:

$ cat chr1.bed 
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr1    10165   10215   382_570_1514    9   +
chr1    10170   10196   664_789_434 1   +
chr1    10171   10221   439_96_1067 10  +

using parallel:

$ seq 22 | parallel 'grep  chr{} input.bed > chr{}.bed'
ADD COMMENT
2
Entering edit mode

Note that grep chr1 input.bed will return records for chr1, chr10, chr11, etc. and grep chr2 input.bed will return records for chr2, chr20, and chr21, so the output for those search keys will contain more intervals than probably expected and is likely incorrect.

ADD REPLY
0
Entering edit mode

That is correct. But it can be improved with -w option:

$ seq 22 | parallel "grep  -w chr{} test.bed > chr{}.bed"

Example bed:

 $ cat test.bed
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
chr2    10165   10215   382_570_1514    9   +
chr2    10170   10196   664_789_434 1   +
chr3    10171   10221   439_96_1067 10  +
chr3    10174   10223   378_525_47  5   +
chr10   10174   10224   94_430_1386 6   +
chr10   10217   10279   4_218_1107  5   +
chr20   10224   10267   138_1957_415    59  -
chr20   10242   10272   638_1023_1664   4   +

After execution:

$ cat chr1.bed 
chr1    10094   10144   563_280_1408    1   +
chr1    10095   10144   421_286_567 0   +
$ cat chr10.bed
chr10   10174   10224   94_430_1386 6   +
chr10   10217   10279   4_218_1107  5   +
$ cat chr2
chr20.bed  chr21.bed  chr22.bed  chr2.bed   
$ cat chr2.bed 
chr2    10165   10215   382_570_1514    9   +
chr2    10170   10196   664_789_434 1   +
$ cat chr20.bed 
chr20   10224   10267   138_1957_415    59  -
chr20   10242   10272   638_1023_1664   4   +
ADD REPLY
0
Entering edit mode

This perfectly working thanks but if there are multiple files then how can parallel be used where each file need to be labeled with original file name and the chromosome name.

ADD REPLY
0
Entering edit mode

If original file name is "test.bed", and you want to append this to every chromosome file to be created, then command would be:

$ seq 22 | parallel "grep  -w chr{} test.bed > test.bed.chr{}.bed"

This would create multiple files with original name and the chromosome name.

ADD REPLY
1
Entering edit mode
6.9 years ago
GenoMax 147k

Have you tried grep instead of awk? grep -w "chr1" yourfile > chr1.bed and grep -w "chr11" yourfile > chr11.bed etc.

ADD COMMENT
1
Entering edit mode
6.9 years ago
Charles Plessy ★ 2.9k

To split the BED file in multiple files, with a single awk command, without for loops:

$ cat toto.bed 
chr1    1   2
chr1    1   3
chr2    1   4
chr3    4   5
$ awk '
  NR == FNR {chr[$1] ; next}
  {for (c in chr) {
    if ($1 == c) {
      print >> c".bed"
      close(c".bed")
      next
  }}}' toto.bed toto.bed
$ head chr*
==> chr1.bed <==
chr1    1   2
chr1    1   3

==> chr2.bed <==
chr2    1   4

==> chr3.bed <==
chr3    4   5

With this syntax, your file only needs to be read twice. Big thanks to https://stackoverflow.com/questions/8363718/grep-to-multiple-output-files/8363776#8363776.

ADD COMMENT

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6