Hi -
I am encountering errors when using GATK haplotypecaller with UCSC annotated VCF (bcftools), and BAM files with defined read groups. I also tried using GATK haplotype caller with a standard UCSC HG38 VCF and got the same error message which can be seen below.
My question is, what should I do to avoid this error message: no overlapping contigs found Here is my code:
Creation of contig header: awk '{printf("##contig=<id=%s,length=%d>\n",$1,$2);}' ucsc.hg38.fasta.fai > contigs_header_ucsc_hg38.txt
BCFtools code: margaret@SII-T7500-01:~/Data/UCSC/hg38$ bcftools annotate -h contigs_header_ucsc_hg38.txt All_20151104.vcf.gz -O v -o /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf Warning: The index file is older than the data file: All_20151104.vcf.gz.tbi
GATK haplotype caller output (error message):
margaret@SII-T7500-01:~/Programs$ java -jar GenomeAnalysisTK.jar -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,908 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,911 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 11:43:39,911 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 11:43:39,911 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 11:43:39,912 HelpFormatter - [Wed Nov 02 11:43:39 MST 2016] Executing on Linux 4.4.0-45-generic amd64 INFO 11:43:39,912 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13 JdkDeflater INFO 11:43:39,915 HelpFormatter - Program Args: -R /home/margaret/Data/UCSC/hg38/ucsc.hg38.fasta -T HaplotypeCaller -I /home/margaret/Data/UCSC/hg38/normal_ucsc_hg38_readGroups.bam -I /home/margaret/Data/UCSC/hg38/tumor_ucsc_hg38_readGroups.bam -D /media/margaret/Data/Data/fastqs/annotated_dbsnp_ucsc_hg38.vcf -o /media/margaret/Data/Data/fastqs/haplotyped_dbsnp_ucsc_hg38.vcf INFO 11:43:39,919 HelpFormatter - Executing as margaret@SII-T7500-01 on Linux 4.4.0-45-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_101-b13. INFO 11:43:39,920 HelpFormatter - Date/Time: 2016/11/02 11:43:39 INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,920 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:43:39,950 GenomeAnalysisEngine - Strictness is SILENT INFO 11:43:40,029 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 11:43:40,035 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:43:40,060 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 11:43:40,068 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
Closing until you demonstrate that you've actually read the error message.
In all fairness to the user why cant the tool just say one thing:
why does this all need to be fished out from dozens of lines?
come to think of it, why not just recognize this error and run it. It is such a little thing to fix. Load it up as chr and be done with it.
The ways of GATK will always remain a mystery to me. I agree on simply handling with/without "chr", if we can do that in deepTools then GATK should be able to do that with their actual professional developers.
Can't we just start over with this entire chromosome nomenclature thing and globally agree on names?
I actually have a theory of why the naming of 1,2,3 for chromosome is around and seems so difficult to stamp out.
I think it is perpetuated by the (lazy) programmers that would rather generate chromosome names with a construct like
range(1,22)
I'll take the opposing view that the Ensembl nomenclature is the sensible way and that needlessly tacking
chr
onto things (regardless of whether they're chromosomes) is just a waste of bytes.It is about usability not waste of bytes. A lot many more resources may be wasted from misunderstandings.
When we see
chr1
in any context we know it is a chromosome.When we see a
1
it looks like a number and suggests a numerical value rather than a label. Hence the confusion.The waste of storage is probably the most compelling argument for just using 1,2,3; but to me using chr1 makes more sense because the name is a string, not an integer.
I think the "chr1" vs. "1" debate is the "vi" vs. "emacs" of bioinformatics :P
However I wouldn't really care which would be the outcome if a consensus was found.
Good luck with consensus - I work with several
nonhumanmodel organism biologists who insist on using Roman numerals (which really mucks up the sort order when you get to IX).It would also be fine for me if we would use the names of Renaissance painters or Greek philosophers as identifiers. Sir your son has trisomy Michelangelo.
Or Socrates syndrome...
Getting punchy from sleep depravation. Signing off!
I took several steps to get the contigs in the same format chr...(i.e. changing the contig header in the post-bcftools VCF file, aligning my samples to HG38, even changing the VCF file's chrom entries to chr# ) so that this error would not continue to show up. Unfortunately, I kept getting the same error. The point of this post was to get tips on how to successfully change the contig format so that this error no longer appears. There are several bioinformaticists that encounter this error when trying to prepare their data files for various programs.
That's why I think it would be worthwhile to re-open this post.
Just use the same reference as
-R ref.fasta
with HaplotypeCaller as you used when aligning the samples. Avoid manual tampering with headers/vcf files!Thanks for the suggestion, but I tried that, its just hard to see in the command I used above.
Please use
ADD REPLY
when answering to earlier comments, as such this thread remains logically structured and easy to follow.And if that's what you did then magically somehow the chromosomes got changed. So that's not what you did. You told us earlier that you manually changed things. I've used haplotypecaller hundreds of times and never had an error like that unless you combine an Ensembl reference for variant calling with reads mapped to a UCSC reference or vice versa. The error message is very clear. The contigs in your vcf are 1, 2, 3,... while those in the reference you supplied are chr1, chr2, chr3.
therefore, it doesn't matter whether your reads contain "chr1" or "1", as long as the reference contains exactly the same nomenclature...