Hi,
I am building an NGS pipeline from scratch. FASTQ files have been aligned to the hg19 reference with BWA-MEM. Samtools was used for sorting and creating the index. Picard tools was used for marking duplicates and estimate the library complexity.
At this point, I want to run GATK BaseRecalibrator. However, I get this error message:
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
reference contigs = [NC_000001.10, NT_113878.1, NT_167207.1, NC_000002.11, NC_000003.11, NC_000004.11, NT_113885.1, NT_113888.1, NC_000005.9, NC_000006.11, NC_000007.13, NT_113901.1, NC_000008.10, NT_113909.1, NT_113907.1, NC_000009.11, NT_113914.1, NT_113916.2, NT_113915.1, NT_113911.1, NC_000010.10, NC_000011.9, NT_113921.2, NC_000012.11, NC_000013.10, NC_000014.8, NC_000015.9, NC_000016.9, NC_000017.10, NT_113941.1, NT_113943.1, NT_113930.1, NT_113945.1, NC_000018.9, NT_113947.1, NC_000019.9, NT_113948.1, NT_113949.1, NC_000020.10, NC_000021.8, NT_113950.2, NC_000022.10, NC_000023.10, NC_000024.9, NT_113961.1, NT_113923.1, NT_167208.1, NT_167209.1, NT_167210.1, NT_167211.1, NT_167212.1, NT_113889.1, NT_167213.1, NT_167214.1, NT_167215.1, NT_167216.1, NT_167217.1, NT_167218.1, NT_167219.1, NT_167220.1, NT_167221.1, NT_167222.1, NT_167223.1, NT_167224.1, NT_167225.1, NT_167226.1, NT_167227.1, NT_167228.1, NT_167229.1, NT_167230.1, NT_167231.1, NT_167232.1, NT_167233.1, NT_167234.1, NT_167235.1, NT_167236.1, NT_167237.1, NT_167238.1, NT_167239.1, NT_167240.1, NT_167241.1, NT_167242.1, NT_167243.1, NW_004070864.2, NW_003571030.1, NW_003871056.3, NW_003871055.3, NW_003315905.1, NW_003315906.1, NW_003315907.1, NW_004070863.1, NW_003871057.1, NW_004070865.1, NW_003315903.1, NW_003315904.1, NW_003315908.1, NW_004504299.1, NW_003571032.1, NW_003571033.2, NW_003315909.1, NW_003571031.1, NW_003871060.1, NW_003871059.1, NW_003315910.1, NW_004775426.1, NW_003315911.1, NW_003871058.1, NW_003315912.1, NW_003315913.1, NW_004775427.1, NW_003315915.1, NW_003315916.1, NW_003571035.1, NW_003315914.1, NW_003571034.1, NW_003315920.1, NW_003571036.1, NW_003315917.2, NW_003315918.1, NW_003871061.1, NW_004775428.1, NW_003315919.1, NW_004070866.1, NW_003871063.1, NW_003315921.1, NW_004504300.1, NW_003871062.1, NW_004775429.1, NW_004166862.1, NW_003571039.1, NW_003571038.1, NW_004775430.1, NW_003871064.1, NW_003571041.1, NW_003571037.1, NW_003871065.1, NW_003315922.2, NW_003571040.1, NW_003571042.1, NW_004775431.1, NW_003871066.2, NW_003315923.1, NW_003315924.1, NW_003315928.1, NW_003871067.1, NW_003315929.1, NW_003315930.1, NW_003315931.1, NW_004504301.1, NW_004070869.1, NW_003315925.1, NW_004070867.1, NW_004070868.1, NW_003315926.1, NW_003315927.1, NW_003571043.1, NW_003871071.1, NW_003315932.1, NW_003315934.1, NW_003315935.1, NW_003871068.1, NW_004504302.1, NW_003871070.1, NW_004775432.1, NW_003871069.1, NW_003315933.1, NW_004070870.1, NW_003871075.1, NW_003871082.1, NW_003315936.1, NW_003571045.1, NW_003871073.1, NW_003871074.1, NW_003571046.1, NW_004070871.1, NW_003871081.1, NW_003871079.1, NW_003871077.1, NW_003871080.1, NW_003871078.1, NW_003871072.2, NW_003871076.1, NW_003571048.1, NW_003571049.1, NW_003871083.2, NW_003571047.1, NW_003571050.1, NW_003315938.1, NW_003315939.1, NW_003315941.1, NW_003315942.2, NW_004504303.2, NW_003315940.1, NW_003315937.1, NW_003571051.1, NW_004166863.1, NW_003315943.1, NW_003315944.1, NW_003871084.1, NW_003315945.1, NW_003871085.1, NW_003315946.1, NW_004070872.2, NW_003315952.2, NW_003315951.1, NW_003315950.2, NW_004775433.1, NW_003871090.1, NW_004166864.2, NW_003315949.1, NW_003315948.2, NW_003871091.1, NW_003871093.1, NW_003871092.1, NW_003315953.1, NW_003571052.1, NW_003871086.1, NW_003315947.1, NW_003871088.1, NW_003315954.1, NW_003315955.1, NW_003871089.1, NW_003871087.1, NW_003315956.1, NW_003315959.1, NW_003315960.1, NW_003315957.1, NW_003315958.1, NW_003315961.1, NW_003871094.1, NW_003571053.2, NW_003315962.1, NW_003315964.2, NW_003315965.1, NW_003315963.1, NW_004775434.1, NW_004166865.1, NW_003571054.1, NW_003571055.1, NW_003571056.1, NW_003571057.1, NW_003571058.1, NW_003571059.1, NW_003571060.1, NW_003571061.1, NW_003315966.1, NW_003871095.1, NW_004504304.1, NW_003571063.2, NW_003315967.1, NW_003315968.1, NW_003315969.1, NW_003315970.1, NW_004775435.1, NW_004070874.1, NW_004070873.1, NW_004070875.1, NW_003871096.1, NW_003315972.1, NW_003315971.2, NW_004504305.1, NW_004070876.1, NW_003571064.2, NW_003871098.1, NW_003871099.1, NW_004070879.1, NW_004166866.1, NW_004070880.2, NW_004070877.1, NW_004070881.1, NW_004070882.1, NW_003871100.1, NW_003871101.3, NW_004070883.1, NW_004070884.1, NW_004070885.1, NW_003871102.1, NW_004070878.1, NW_004070891.1, NW_004070892.1, NW_004070893.1, NW_004070886.1, NW_004070887.1, NW_004070888.1, NW_004070889.1, NW_004070890.2, NW_003871103.3, NT_167244.1, NT_113891.2, NT_167245.1, NT_167246.1, NT_167247.1, NT_167248.1, NT_167249.1, NT_167250.1, NT_167251.1, NC_012920.1]
features contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y]
After running the GATK command the first time, I saw that it needed an additional index reference.dict
file. To create the file, I run gatk CreateSequenceDictionary -R reference.fasta
(as recommended on this page https://gatk.broadinstitute.org/hc/en-us/articles/360035531652-FASTA-Reference-genome-format) on the same reference file that was used for all previous analysis steps.
Previously, the reference file was only processed by the bwa index reference.fasta
command. I used the same reference.fasta
file for the entire pipeline.
The reference files look fine to me; I assume the error arises due to the chromosome labels (features contigs) in the gnomAD.vcf
file used as --known-sites
in the command:
gatk BaseRecalibrator -I sample.sorted.bam -R reference.fasta --known-sites gnomad.genomes.r2.1.1.sites.vcf --known-sites gnomad.exomes.r2.1.1.sites.vcf -O recal_data.table
Am i supposed to edit these input files to match the contigs labels? Do you recommend using other population vcf
files? Any other idea on how to fix this issue?
Any help would be appreciated.
Thanks for the reply. I did use the same reference file for the entire pipeline. I had a look at the
reference.fasta
and thereference.dict
files, and they look file to me. Now, I think it is a mismatch between thereference.fasta
contig labels and the chromosome nomenclature in thevcf
files used for theBaseRecalibrator --known-sites
option.I edited my post accordingly.
yes, that might be the case, how was the vcf obtained ? the
reference.fasta
has to remain constantI downloaded them from the official gnomAD download page
Did you find a solution for this problem? My reference FASTA is from Ensembl and so has Ensembl chromosome naming (e.g. 1, 2, 3) but the VCF file I has contains e.g. chr1, chr2.
I eventually gave up on using the gnomAD VCF and settled for only using the dbSNP VCF, which used the same reference. You'll probably find a compatible dbSNP file, but converting the gnomAD file may be complicated. Have a look at this post.