Is it possible to run haplotypecaller in gnu parallel ?
3
3
Entering edit mode
3.9 years ago
DareDevil ★ 4.3k

I have a bash script about HaplotypeCaller.sh. Is it possible to run this in gnu parallel ?

cat HaplotypeCaller.sh | parallel

#!/bin/bash

for i in *.bam; do 
gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I $i -O $i.vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list; done
gnu parallel HaplotypeCaller parallel gatk • 3.5k views
ADD COMMENT
1
Entering edit mode

Something like this should work

find . -name '*.bam' | rev | cut -c4- | parallel -I{} gatk --java-options "-Xmx40g" HaplotypeCaller \
    -R /media/gatk/Homo_sapiens_assembly38.fasta \
    -I {}.bam \
    -O {}.vcf.gz \
    --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz \
    -L /media/gatk/target/ag_V6_list.interval_list
ADD REPLY
0
Entering edit mode

I get following error while running above command

Error was: .detros_1C100/..bam with exception: Cannot read non-existent file:
ADD REPLY
0
Entering edit mode

@ 4galaxy77 rev | cut -c4- removes the .bam extension. But {.} in parallel removes the extension.

ADD REPLY
1
Entering edit mode

see if following prints exact commands you want to run and then remove --dry-run option

parallel --dry-run 'gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I {} -O {.}.vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list' ::: *.bam

make sure that system has enough resources to execute parallel commands. Other wise, limit memory / threads/ cpus.

ADD REPLY
0
Entering edit mode

use a workflow manager.

ADD REPLY
4
Entering edit mode
3.9 years ago
ole.tange ★ 4.5k
#!/bin/bash

do_one() {
  gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I "$1" -O "$1".vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list
}
export -f do_one
# test that do_one works on a single bam file. Then:
parallel do_one ::: *.bam

If you do not want the output to be named foo.bam.vcf.gz but instead foo.vcf.gz:

#!/bin/bash

do_one() {
  gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I "$1" -O "$2" --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list
}
export -f do_one
# test that do_one works on a single bam file:
#   do_one foo.bam foo.vcf.gz
# Then:
parallel do_one {} {.} ::: *.bam
ADD COMMENT
2
Entering edit mode

As written above, the output files will have .bam.vcf.gz extension, where retaining the .bam was probably not desirable, and is why the OP's pipeline includes | cut -c4-.

I would not use a bash function in this case, and would rather write:

parallel gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I {} -O {.}.vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list ::: *.bam

Taking advantage of parallel's {.} variable which is the file without its last extension (in this case, trimming the trailing .bam).

ADD REPLY
0
Entering edit mode

Cheers !! this works!!

ADD REPLY
0
Entering edit mode

This gives me an error export: Illegal option -f

ADD REPLY
0
Entering edit mode

What is your shell? Something other than bash? In any case, see my other comment. You don't need a function at all.

ADD REPLY
0
Entering edit mode

Are you writing exactly what I wrote? Because if you change #!/bin/bash to #!/bin/sh then it will not work.

ADD REPLY
0
Entering edit mode

Ya I have taken the same code provided by you. It gives an error export: Illegal option -f

ADD REPLY
0
Entering edit mode

Update: When I executed your code one by one it started running. But When I saved your code in a file as parallel_haplo.sh it does not work.

sh parallel_haplo.sh It gives the error: parallel_haplo.sh : 6: export: Illegal option -f

It works when I run as bash parallel_haplo.sh. I don't know what made the difference

ADD REPLY
3
Entering edit mode
3.9 years ago

nextflow (not tested)

params.ref="/media/gatk/Homo_sapiens_assembly38.fasta"
params.bams=""
params.dbsnp="/media/gatk/dbsnp_138.hg38.vcf.gz"
params.intervals="/media/gatk/target/ag_V6_list.interval_list"

Channel.fromPath(params.bams).splitCsv(header: false,sep:',',strip:true).map{T->file(T[0])}.set{bamfiles}

process hapCaller {
tag "${bam.name}"
memory "40g"
input:
   file bam from bamfiles
output:
   file("${bam.getBaseName()}.vcf.gz") into vcf
script:
"""
gatk --java-options " -Xmx${task.memory.giga}g -Djava.io.tmpdir=." HaplotypeCaller \
    -R ${params.ref} \
    -I {bam.toRealPath()} \
    -O "${bam.getBaseName()}.vcf.gz" \
    --dbsnp "${params.dbsnp}" \
    -L "${params.intervals}"
"""
}
ADD COMMENT
2
Entering edit mode
3.9 years ago
DareDevil ★ 4.3k

The answer given by Malcolm.Cook, ole.tange and cpad0112 works well

Method 1:

parallel gatk --java-options "-Xmx40g" HaplotypeCaller \
-R /media/gatk/Homo_sapiens_assembly38.fasta -I {} -O {.}.vcf.gz \
 --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz \
-L /media/gatk/target/ag_V6_list.interval_list ::: *.bam

Method 2:

#!/bin/bash

do_one() {
  gatk --java-options "-Xmx40g" HaplotypeCaller -R /media/gatk/Homo_sapiens_assembly38.fasta -I "$1" -O "$1".vcf.gz --dbsnp /media/gatk/dbsnp_138.hg38.vcf.gz -L /media/gatk/target/ag_V6_list.interval_list
}
export -f do_one
# test that do_one works on a single bam file. Then:
parallel do_one ::: *.bam
ADD COMMENT

Login before adding your answer.

Traffic: 1652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6