Question

GATK Haplotype Caller

0

Entering edit mode

8.8 years ago

plink_9857 ▴ 50

Hi -

I am encountering errors when using GATK haplotypecaller with UCSC annotated VCF (bcftools), and BAM files with defined read groups. I also tried using GATK haplotype caller with a standard UCSC HG38 VCF and got the same error message which can be seen below.

My question is, what should I do to avoid this error message: no overlapping contigs found Here is my code:

Creation of contig header: awk '{printf("##contig=<id=%s,length=%d>\n",$1,$2);}' ucsc.hg38.fasta.fai > contigs_header_ucsc_hg38.txt

BCFtools code: margaret@SII-T7500-01:~/Data/UCSC/hg38$ bcftools annotate -h contigs_header_ucsc_hg38.txt All_20151104.vcf.gz -O v -o /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf Warning: The index file is older than the data file: All_20151104.vcf.gz.tbi

GATK haplotype caller output (error message):

margaret@SII-T7500-01:~/Programs$ java -jar GenomeAnalysisTK.jar -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,908 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,911 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 11:43:39,911 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 11:43:39,911 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 11:43:39,912 HelpFormatter - [Wed Nov 02 11:43:39 MST 2016] Executing on Linux 4.4.0-45-generic amd64 INFO 11:43:39,912 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13 JdkDeflater INFO 11:43:39,915 HelpFormatter - Program Args: -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,919 HelpFormatter - Executing as margaret@SII-T7500-01 on Linux 4.4.0-45-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13. INFO 11:43:39,920 HelpFormatter - Date/Time: 2016/11/02 11:43:39 INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,950 GenomeAnalysisEngine - Strictness is SILENT INFO 11:43:40,029 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 11:43:40,035 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:43:40,060 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 11:43:40,068 HCMappingQualityFilter - Filtering out reads with MAPQ < 20

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions https://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Input files /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf and reference have incompatible contigs. Please see https://www.broadinstitute.org/gatk/guide/article?id=63for more information. Error details: No overlapping contigs found.

ERROR /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]

ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]

ERROR ------------------------------------------------------------------------------------------

next-gen GATK haplotype caller contigs • 6.0k views

ADD COMMENT • link updated 8.6 years ago by Biostar 20 • written 8.8 years ago by plink_9857 ▴ 50

3

Entering edit mode

Closing until you demonstrate that you've actually read the error message.

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

1

Entering edit mode

In all fairness to the user why cant the tool just say one thing:

Your chromosome names do no match, you are calling them 1,2,3 the reference is chr1, chr2...

why does this all need to be fished out from dozens of lines?

come to think of it, why not just recognize this error and run it. It is such a little thing to fix. Load it up as chr and be done with it.

ADD REPLY • link 8.8 years ago by Istvan Albert 103k

2

Entering edit mode

The ways of GATK will always remain a mystery to me. I agree on simply handling with/without "chr", if we can do that in deepTools then GATK should be able to do that with their actual professional developers.

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

0

Entering edit mode

Can't we just start over with this entire chromosome nomenclature thing and globally agree on names?

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

0

Entering edit mode

I actually have a theory of why the naming of 1,2,3 for chromosome is around and seems so difficult to stamp out.

I think it is perpetuated by the (lazy) programmers that would rather generate chromosome names with a construct like range(1,22)

ADD REPLY • link 8.8 years ago by Istvan Albert 103k

0

Entering edit mode

I'll take the opposing view that the Ensembl nomenclature is the sensible way and that needlessly tacking chr onto things (regardless of whether they're chromosomes) is just a waste of bytes.

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

1

Entering edit mode

It is about usability not waste of bytes. A lot many more resources may be wasted from misunderstandings.

When we see chr1 in any context we know it is a chromosome.

When we see a 1 it looks like a number and suggests a numerical value rather than a label. Hence the confusion.

ADD REPLY • link 8.8 years ago by Istvan Albert 103k

0

Entering edit mode

The waste of storage is probably the most compelling argument for just using 1,2,3; but to me using chr1 makes more sense because the name is a string, not an integer.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

1

Entering edit mode

I think the "chr1" vs. "1" debate is the "vi" vs. "emacs" of bioinformatics :P

ADD REPLY • link 8.8 years ago by Devon Ryan 105k

1

Entering edit mode

However I wouldn't really care which would be the outcome if a consensus was found.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

1

Entering edit mode

Good luck with consensus - I work with several ~~nonhuman~~ model organism biologists who insist on using Roman numerals (which really mucks up the sort order when you get to IX).

ADD REPLY • link 8.8 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

It would also be fine for me if we would use the names of Renaissance painters or Greek philosophers as identifiers. Sir your son has trisomy Michelangelo.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

1

Entering edit mode

Or Socrates syndrome...

Getting punchy from sleep depravation. Signing off!

ADD REPLY • link 8.8 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

I took several steps to get the contigs in the same format chr...(i.e. changing the contig header in the post-bcftools VCF file, aligning my samples to HG38, even changing the VCF file's chrom entries to chr# ) so that this error would not continue to show up. Unfortunately, I kept getting the same error. The point of this post was to get tips on how to successfully change the contig format so that this error no longer appears. There are several bioinformaticists that encounter this error when trying to prepare their data files for various programs.

That's why I think it would be worthwhile to re-open this post.

ADD REPLY • link 8.8 years ago by plink_9857 ▴ 50

1

Entering edit mode

Just use the same reference as -R ref.fasta with HaplotypeCaller as you used when aligning the samples. Avoid manual tampering with headers/vcf files!

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

0

Entering edit mode

Thanks for the suggestion, but I tried that, its just hard to see in the command I used above.

ADD REPLY • link 8.8 years ago by plink_9857 ▴ 50

1

Entering edit mode

Please use ADD REPLY when answering to earlier comments, as such this thread remains logically structured and easy to follow.

And if that's what you did then magically somehow the chromosomes got changed. So that's not what you did. You told us earlier that you manually changed things. I've used haplotypecaller hundreds of times and never had an error like that unless you combine an Ensembl reference for variant calling with reads mapped to a UCSC reference or vice versa. The error message is very clear. The contigs in your vcf are 1, 2, 3,... while those in the reference you supplied are chr1, chr2, chr3.

ADD REPLY • link 8.8 years ago by WouterDeCoster 48k

0

Entering edit mode

therefore, it doesn't matter whether your reads contain "chr1" or "1", as long as the reference contains exactly the same nomenclature...

ADD REPLY • link 8.8 years ago by TriS ★ 4.8k