GATK variant calling error
1
0
Entering edit mode
4.2 years ago
endretoth ▴ 40

Hi,

I have a trouble with variant calling and need help of someone who knows GATK. In a population study, I have mapped, with BWA-MEM, several loci onto a reference, which is one long fasta sequence (not chromosomes). After, I sorted and converted the SAM files so now I have BAM files, for all my samples.

I would like to call variants, only SNPs, from my samples using GATK. I would like to end up with a VCF with SNPs from all samples (I hope that GATK can match SNP positions among multiple samples).

So, I installed GATK 4.1.9.0, and created my command (it is very simple):

./gatk-4.1.9.0/gatk --java-options HaplotypeCaller ‐R ./reference.fasta ‐I ./samples/*.bam ‐O ./snps_output.vcf

I also tried:

./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I ./samples/*.bam ‐o ./snps_output.vcf

with know luck...

Error of the first command:

Error: Could not find or load main class HaplotypeCaller
Caused by: java.lang.ClassNotFoundException: HaplotypeCaller

Error of the second command:

A USER ERROR has occurred: Illegal argument value: Positional arguments were provided ',‐R{./reference.fasta{‐I{./samples/AL1-1_sorted.bam{./samples/BU1-1_sorted.bam{‐o{./snps_output.vcf}' but no positional argument is defined for this tool.

Since this is my first time with GATK please be gentle :D Thanks!

GATK SNP variant calling • 3.7k views
ADD COMMENT
0
Entering edit mode
4.2 years ago

in

./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I ./samples/*.bam ‐o ./snps_output.vcf

‐I ./samples/*.bam won't work. You need to put a -I before each bam -I in1.bam -I in2.bam -I in3.bam

or put the path to your bams in a file with the suffix .list

find ./samples/ -type f -name "*.bam" > in.list
  ./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I in.list ‐o ./snps_output.vcf ....
ADD COMMENT
0
Entering edit mode

Dear Pierre, I have tried your second suggestion, because I need the same snp positions in all samples(this is a population genetic study). Unfortunately, it didn't work.

My command was:

find ./samples/ -type f -name "*.bam" > in.list
./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I in.list ‐o ./snps_output.vcf

My result:

***********************************************************************

A USER ERROR has occurred: Illegal argument value: Positional arguments were provided ',‐R{./reference.fasta{‐I{in.list{‐o{./snps_output.vcf}' but no positional argument is defined for this tool.

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
ADD REPLY
0
Entering edit mode

Can HC take multiple inputs? Not a GATK user here but I think it can only take a single sample per run to produce an intermediate gVCF which then has to be postprocess for joint variant calling, no? What Pierre said :)

ADD REPLY
0
Entering edit mode

Can HC take multiple inputs?

yes. One can generate a VCF (not gvcf) directly from a small set of BAMs

ADD REPLY
0
Entering edit mode

you want option -O not -o

ADD REPLY
0
Entering edit mode

Hi Pierre,

I have changed -o to -O but the error still present:

***********************************************************************
A USER ERROR has occurred: Illegal argument value: Positional arguments were provided ',‐R{./reference.fasta{‐I{in.list{‐O{./snps_output.vcf}' but no positional argument is defined for this tool.
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

Also, I have tried Freebayes for the same purpose but that cannot use multiple populations/samples

freebayes -f reference.fasta --bam-list in.list >var.vcf

this creates a vcf, but samples are combined...(I have checked in IGV viewer and there I cannot see multiple samples) I have no idea why :( It seems I stuck completely :(

ADD REPLY
0
Entering edit mode

it works on my machine:

 gatk --java-options "-Xmx500m" HaplotypeCaller -R rotavirus_rf.fa -I S1.bam -I S2.bam -O ~/jeter.vcf.gz
(...)
16:39:57.296 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 2.00 sec
16:39:57.297 INFO  HaplotypeCaller - Shutting down engine
[October 13, 2020 4:39:57 PM CEST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.13 minutes.
Runtime.totalMemory()=508559360

check your files exist.

check your really have whitespaces between your arguments. For example put a echo in front of the command line and test with file - . MUST be ASCII text.

echo gatk --java-options "-Xmx500m" HaplotypeCaller -R rotavirus_rf.fa -I S1.bam -I S2.bam -O ~/jeter.vcf.gz | file -
/dev/stdin: ASCII text
ADD REPLY
0
Entering edit mode

Hi Pierre,

Unfortunately, the command still doesn't work. However, I think I found the source of the problem.

My reference is built from a library of EST sequences which were Blasted on reference species genome to recover the "full" corresponding gene. These DNA sequences were filtered to have a proper length, larger than the sequences what we want to map (paired-end reads from RADseq, ~300bp length). Reconstructed gene sequences, used as a reference, were also concatenated to have one long sequence for mapping. After, I used BWA-MEM to map about 300,000 RAD loci on this reference. I did not recognize in time that reference has some duplicated sequences, although when we Blasted, at the beginning, we kept only the very best hits.

This probably the source of error, because I recently found (What Does The Zero Mapping Quality Mean For The Bwa Mapper?) that "region is duplicated then all the reads mapping there will also map elsewhere" and this result in "Zero mapping quality" and "Variant callers ignore reads with mapping quality zero". I have checked my BAM files in IGV and almost all have zero quality...probably GATK cannot call SNPs from zero quality mapping.

Now, probably, I will go back to the Blast results and filter sequences that are duplicated. Long job. May I kindly ask your opinion? The original plan was to filter SNPs from those RAD loci that are positioned in specific genes (drought stress responsive genes: so map RAD loci on genes and after call SNPs, in every individual and populations).

Please check this IGV printscreen

https://freeimage.host/i/2bcMyN

ADD REPLY
0
Entering edit mode

that wouldn't explain the error message: A USER ERROR has occurred: Illegal argument value: Positional arguments were provided

ADD REPLY
0
Entering edit mode

You are right but at the moment I have no idea what else can cause the problem. I was thinking maybe, if there are only very few acceptable loci, GATK cannot hadle this exception, but I don't know :( Also, in IGV I see some good quality loci and some, very few, SNPs in it... so I don't know why.

May I kindly ask you to look at my data (it is very small)?

ADD REPLY

Login before adding your answer.

Traffic: 1700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6