Question

Error while running variant calling for DNA Sequence

0

Entering edit mode

8.2 years ago

sarthak • 0

I am trying to run variant calling for input BAM file as follows:

java -Xmx5120m  -jar /data/spark/tools/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 4 -R /data/spark/ref/ucsc.hg19.fasta -I /data/sarthaksharma/region0-3.bam -o region0.vcf -stand_call_conf 30.0 -stand_emit_conf 30.0 -L /data/sarthaksharma/bed0.intervals --no_cmdline_in_header --disable_auto_index_creation_and_locking_when_reading_rods

I am getting the following error:

##### ERROR
##### ERROR MESSAGE: Badly formed genome loc: Contig 'chrY      150797  151037' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?
##### ERROR ------------------------------------------------------------------------------------------

I tried to look in the answers here: Snp Calling Error By Gatk Unifiedgenotyper Program but it didn't work. When I change the format of the .intervals file to .bed, it ran but gave an empty vcf file.

java -Xmx5120m  -jar /data/spark/tools/GenomeAnalysisTK.jar -T HaplotypeCaller -nct 4 -R /data/spark/ref/ucsc.hg19.fasta -I /data/sarthaksharma/region0-3.bam -o region0.vcf -stand_call_conf 30.0 -stand_emit_conf 30.0 -L /data/sarthaksharma/bed0.bed --no_cmdline_in_header --disable_auto_index_creation_and_locking_when_reading_rods

The log for this command:

Total compute time in PairHMM computeLikelihoods() : 0.0
INFO  10:39:24,505 HaplotypeCaller - Ran local assembly on 0 active regions
INFO  10:39:24,532 ProgressMeter -            done        1.76e+06    2.0 s        1.0 s    100.0%         2.0 s     0.0 s
INFO  10:39:24,534 ProgressMeter - Total runtime 2.03 secs, 0.03 min, 0.00 hours
INFO  10:39:24,535 MicroScheduler - 488 reads were filtered out during the traversal out of approximately 507 total reads (96.25%)
INFO  10:39:24,536 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  10:39:24,537 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  10:39:24,538 MicroScheduler -   -> 488 reads (96.25% of total) failing HCMappingQualityFilter
INFO  10:39:24,538 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  10:39:24,539 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  10:39:24,539 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO  10:39:24,540 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter
INFO  10:39:26,141 GATKRunReport - Uploaded run statistics report to AWS S3

Could someone please tell any possible cause of this?

EDIT This is the snippet of the .bed file (tab separated):

chrY    150797  151037
chrY    155397  155496
chrY    157258  157446
chrY    158123  158363
chrY    249337  249469
chrY    251487  251707
chrY    252086  252146

gatk vcf haplotypeCaller • 2.4k views

ADD COMMENT • link 8.2 years ago by sarthak • 0

0

Entering edit mode

The Reference sequence used to generate the BAM is not the same one you are using. Check with

samtools view -H region0-3.bam | less

to know which Reference was used to generate BAM.

ADD REPLY • link 8.2 years ago by venu 7.1k

0

Entering edit mode

Can you tell me what the result is of running without the -L argument? How is your bed file separated?

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

The bed file is tab separated...I have als added the snippet in the question above. And when I run without the -L argument, it ran and printed the following three entries in the vcf file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1
chrY    10011648        .       T       G       30.74   .       AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=43.32;MQ0=0;QD=15.37      GT:AD:DP:GQ:PL  1/1:0,2:2:6:58,6,0
chrY    10011673        .       G       A       62.74   .       AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=43.32;MQ0=0;QD=31.37      GT:AD:DP:GQ:PL  1/1:0,2:2:6:90,6,0
chrY    10011677        .       G       A       62.74   .       AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=43.32;MQ0=0;QD=31.37      GT:AD:DP:GQ:PL  1/1:0,2:2:6:90,6,0

ADD REPLY • link 8.2 years ago by sarthak • 0

0

Entering edit mode

So variantcalling works and there is a problem with the interval file... You are sure that it's tab separated and not mixed with spaces or something?

Did you notice in the log 488 reads were filtered out during the traversal out of approximately 507 total reads (96.25%)? So that explains why you only get three variants.

With the -L argument, would those 3 variants be filtered out? Essentially, are those 3 variants included in the bed file intervals?

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

I have checked it is a tab separated file. Here is how I am making it:

bedtools intersect -a gcat_set_025.bed -b tmp0.bed -header > bed0.bed

Would it do filtering if the intervals file is not in correct format.

ADD REPLY • link 8.2 years ago by sarthak • 0

0

Entering edit mode

Besides the problem with the interval file I would worry more about the bam file, given that only 19 reads are considered good enough by HaplotypeCaller.

Could you run sed 's/ \+ /\t/g' bed0.bed > newbed0.bed and check if that solves your problem?

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks...but this too didn't help

ADD REPLY • link 8.2 years ago by sarthak • 0

0

Entering edit mode

Strange. You ran HaplotypeCaller with -L newbed0.bed right? The error was the same?

Can you show me the output of

grep '>' /data/spark/ref/ucsc.hg19.fasta

and

samtools view -H region0-3.bam | less

(as already suggested by venu)

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

When I am running with the newbed0.bed, I get the same log as I have shown in the question, no error but the vcf file is empty. Here is the snippet of the grep (not complete result, there are lot of lines)

grep '>' /data/spark/ref/ucsc.hg19.fasta

>chrM
>chr1
>chr2
>chr3
>chr4
>chr5
...
>chr21
>chr22
>chrX
>chrY
>chr1_gl000191_random
>chr1_gl000192_random

And for:

samtools view -H region0-3.bam | less


@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     SN:chr14        LN:107349540
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     SN:chrY LN:59373566

ADD REPLY • link 8.2 years ago by sarthak • 0

0

Entering edit mode

Right, so, connecting the evidence:

your contigs in bam file and reference file are okay. No problem there
your bed file wasn't properly separated, the sed command took care of that and it's fixed in newbed0.bed. Therefore it ran without error this time.
there are no variants in the regions specified by the bed file (since the obtained vcf is empty). Please confirm this by checking whether the 3 variants which were found without -L argument are in the interval(s) specified by the bed file
you really need to have a look why so many reads in your bam file fail (mapping quality)