Entering edit mode
3.6 years ago
Vic
▴
100
I would like to how to use Spark within GATK for multi-threading analysis. Unfortunately, the Broad Institute website for its cluster-Spark tutorial documentation is still in progress. I am using HaplotypeCaller which has been working fine but now I have some pooled seq samples and they take much longer so would like to spread the workload. This is an example of my usage:
gatk HaplotypeCaller -I my_pooled_sample.bam -I another_pooled_sample.bam -L a_chromosome -R ref_genome.fna -O my_out_file.g.vcf -ploidy 10 -- --spark-master local[2]
I used the above sparks command from this example. But it didn't work. I checked the help info and got this:
> gatk forwards commands to GATK and adds some sugar for submitting spark jobs
> --spark-runner <target> controls how spark tools are run
> valid targets are:
> LOCAL: run using the in-memory spark runner
> SPARK: run using spark-submit on an existing cluster
> --spark-master must be specified
> --spark-submit-command may be specified to control the Spark submit command
> arguments to spark-submit may optionally be specified after --
> GCS: run using Google cloud dataproc
> commands after the -- will be passed to dataproc
> --cluster <your-cluster> must be specified after the --
> spark properties and some common spark-submit parameters will be translated
> to dataproc equivalents
I then tried using:
--spark-runner local[2]
Which also didn't work. I would appreciate some guidance. Many thanks.
cross posted https://stackoverflow.com/questions/67074318
I am sorry, I didn't realise that wasn't allowed, I have deleted the other post.
bio_vincent did you ever find a solution to this?