MD5 not the same between contig and reference; both Unified Genotyper and Haplotype Caller
2
0
Entering edit mode
4.6 years ago
j.lunger18 ▴ 30

I'm trying to call variants for a large number of bam files, but continue to get the following error:

##### ERROR   contig  reads is named chr9 with length 138394717 and MD5 6c198acf68b5af7b9d676dfdd531b5de
#### ERROR   contig  reference is named chr9 with length 138394717 and MD5 addd2795560986b7491c40b1faa3978a.

I haven't seen these errors before on any postings, where the length is the same and the MD5 is different. These bam files came from TCGA, so theoretically they were aligned to hg38, and I used the hg38 reference to variant call.

Any help?

GATK VCF MD5 • 1.1k views
ADD COMMENT
1
Entering edit mode
4.6 years ago

things that could change is that the one hg38 reference could have some base with degenerate alphabet and another would just use 'N'. Furthermore I don't know if the upper/lower case is used for the md5 checksum...

You can always fool gatk by replacing the md5 in the dict file....

ADD COMMENT
1
Entering edit mode
4.6 years ago
igor 13k

There is not a single version of hg38. See previous related discussion: BSgenome.Hsapiens.UCSC.hg38 vs BSgenome.Hsapiens.NCBI.GRCh38

Even if the length is the same, you can have masked bases.

Since the question is about TCGA, I assume the files came from GDC. In that case, the reference files are described here: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files

ADD COMMENT
0
Entering edit mode

Thanks, I downloaded the reference file from TCGA. Should I be concerned that the MD5 from this file doesn't match either of the MD5s that were in my error?

From your link: GRCh38.d1.vd1.fa.tar.gz md5: 3ffbcfe2d05d43206f57f81ebb251dc9

From my error: contig reads: 6c198acf68b5af7b9d676dfdd531b5de contig reference: addd2795560986b7491c40b1faa3978a.

ADD REPLY
0
Entering edit mode

You can also try checking the BAM file header (samtools view -H file.bam). It may have the file name of the FASTA file. Maybe it is not GRCh38.d1.vd1.fa.

ADD REPLY

Login before adding your answer.

Traffic: 2564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6