Question

Getting error with awk when using parallel processing in bedtools

1

Entering edit mode

6.1 years ago

nastaran.esfahani ▴ 10

I have 44 .tsv files in one folder and I want to calculate the number of intersect of each pairwise with intersect command of bedtools tool. each output file would have 4 coloums and I just need to save only sum of value of coloumn 4 in each output file. I can do it easily when I do it by one one but when I use parallel processing to do the whole process at the same time I get syntax error

here is the code and result when I try each two pairs by one one manually:

$ bedtools intersect -a p1.tsv -b p2.tsv -c

chr1 1 5 1

chr1 8 12 1

chr1 18 20 1

chr1 21 25 0

bedtools intersect -a p1.tsv -b p2.tsv -c | awk '{sum+=$4} END {print sum}

3

here is the code and error when I am using parallel processing:

$ parallel "bedtools intersect -a {1} -b {2} -c |awk '{sum+=$4} END {print sum}'> {1}.{2}.intersect" ::: `ls *.tsv` ::: `ls *.tsv`

awk: cmd. line:1:{sum+=} END {print sum}
awk: cmd. line:1:            ^ syntax error
awk: cmd. line:1:{sum+=} END {print sum}
awk: cmd. line:1:            ^ syntax error
awk: cmd. line:1:{sum+=} END {print sum}
awk: cmd. line:1:            ^ syntax error
awk: cmd. line:1:{sum+=} END {print sum}
awk: cmd. line:1:            ^ syntax error

bedtools intersect parallel • 2.5k views

ADD COMMENT • link updated 6.1 years ago by ATpoint 89k • written 6.1 years ago by nastaran.esfahani ▴ 10

0

Entering edit mode

whith double quotes, $ is interpreted as a SHELL variable . It must be escaped: https://unix.stackexchange.com/questions/162476

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I found using GNU parallel and awk tricky because of having to escape quotes and such. How many cores do you have on your computer? If it is only two, then I would just go with a non-parallel bash for loop.

ADD REPLY • link 6.1 years ago by jean.elbers ★ 1.7k

score 4 · Answer 1 · 2019-07-02

It is probably quoting that messes things. For simplicity it is better to write the part that you want to parallelize into a function and then parallelize with parallel:

function PL {

  ## Exit if input files are the same:
  if [[ $1 == $2 ]]; then exit; fi

  ## Intersect:
  bedtools intersect -a ${1} -b ${2} -c |awk '{sum+=$4} END {print sum}'> ${1%.*}.${2%.*}.intersect
}; export -f PL

parallel "PL" ::: *.tsv ::: *.tsv