Hello
I have been struggling with GATK 4 and feeling an acute lack of the command line pipelines available in previous versions. I took a look at the Tool Index and tried to make out the commands available from the https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-best-practices.wdl section
I have RNASeq data and I have used a GrCh38.p13 genomic reference not in the GATK resource bucket and currently I have processed around 600 samples with this same reference and followed till the Split n Cigar step with no problem
In the BaseRecalibrator step I get the following error :
a) GATK version
The Genome Analysis Toolkit (GATK) v4.1.7.0 HTSJDK Version: 2.21.2 Picard Version: 2.21.9
b) Exact GATK commands used
GATK BaseRecalibrator -I input.bam -R reference.fasta --known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf --known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf --known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf --known-sites resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf -O recal_data.table
c) The entire error log if applicable.
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found. reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3
I downloaded these files from the Resource Bundle. Could you please tell me why they aren't working. Since re-mapping at this stage is not preferable and I am sure I am using the same reference as I used for all steps before this.
Please help. Thank you very much
Yep, there you have it. Your reference is NCBI nomenclature (chromosome names) and your bundle is not. Either remap the files with a different reference, or try to rename the chromosomes in every of the BAM files or try to find a way to convert the bundle resources to the nomenclature of the bundle.
So when the error says that the feature contigs are different
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, .....
I thought dbSNP was from NCBI. So atleast that should have worked?
These are your contig names. Don't ask me why NCBI thought that distributing reference genomes with these odd
NT_...
was a good idea, I personally find it terrible, but this is what you have.@ATpoint Thank you for the replies. I understand the issue and I am feeling hopeless. Is there any way at all without remapping? Any tool I can use. As of this moment I don't even know which contig in Reference == which contig in Features.
To be sure, can you from one of the BAM file post the output of
samtools idxstats that.bam
? Don't worry, we find a solution.The conversion table you need is probably this one here https://github.com/dpryan79/ChromosomeMappings/blob/master/GRCh38_RefSeq2UCSC.txt but I am not sure there is a command line tool for this which does not involve custom
awk
or similar. I will ask in our Slack if someone has a good idea.Thank you for your help! I will try to find if I can find a way to replace in BAM files. Thanks a lot!
An "official" NCBI assembly report for
hg38
is located here:ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt
There are two columns in this file that are probably relevant to your question:
RefSeq-Accn
andUCSC-style-name
. You can use these as a key-value pairing for lookups.For example,
NC_000001.11
maps tochr1
,NT_187361.1
maps tochr1_KI270706v1_random
, and so on.You could pass this file into a Python or Perl script to map these names, also passing in your BAM as SAM (text). Reheader or rewrite by a map lookup, and then convert back from SAM to BAM.