Question

GATK BaseRecalibrator Step

0

Entering edit mode

5.0 years ago

Seigfried ▴ 80

Hello

I have been struggling with GATK 4 and feeling an acute lack of the command line pipelines available in previous versions. I took a look at the Tool Index and tried to make out the commands available from the https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-best-practices.wdl section

I have RNASeq data and I have used a GrCh38.p13 genomic reference not in the GATK resource bucket and currently I have processed around 600 samples with this same reference and followed till the Split n Cigar step with no problem

In the BaseRecalibrator step I get the following error :

a) GATK version

The Genome Analysis Toolkit (GATK) v4.1.7.0 HTSJDK Version: 2.21.2 Picard Version: 2.21.9

b) Exact GATK commands used

GATK BaseRecalibrator -I input.bam -R reference.fasta --known-sites resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf --known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf --known-sites resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf --known-sites resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf -O recal_data.table

c) The entire error log if applicable.

A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found. reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3

I downloaded these files from the Resource Bundle. Could you please tell me why they aren't working. Since re-mapping at this stage is not preferable and I am sure I am using the same reference as I used for all steps before this.

Please help. Thank you very much

GATK • 3.8k views

ADD COMMENT • link 5.0 years ago by Seigfried ▴ 80

0

Entering edit mode

and I have used a GrCh38.p13 genomic reference not in the GATK resource bucket

Yep, there you have it. Your reference is NCBI nomenclature (chromosome names) and your bundle is not. Either remap the files with a different reference, or try to rename the chromosomes in every of the BAM files or try to find a way to convert the bundle resources to the nomenclature of the bundle.

ADD REPLY • link 5.0 years ago by ATpoint 88k

0

Entering edit mode

So when the error says that the feature contigs are different

features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, .....

I thought dbSNP was from NCBI. So atleast that should have worked?

ADD REPLY • link 5.0 years ago by Seigfried ▴ 80

0

Entering edit mode

reference contigs = [NC_000001.11, NT_187361.1, NT_187362.1, NT_187363.1, NT_187364.1, NT_187365.1, NT_187366.1, NT_187367.1, NT_187368.1, NT_187369.1, NC_000002.12, NT_187370.1, NT_187371.1, NC_000003.12, NT_167215.1, NC_000004.12, NT_113793.3

These are your contig names. Don't ask me why NCBI thought that distributing reference genomes with these odd NT_... was a good idea, I personally find it terrible, but this is what you have.

ADD REPLY • link 5.0 years ago by ATpoint 88k

0

Entering edit mode

@ATpoint Thank you for the replies. I understand the issue and I am feeling hopeless. Is there any way at all without remapping? Any tool I can use. As of this moment I don't even know which contig in Reference == which contig in Features.

ADD REPLY • link 5.0 years ago by Seigfried ▴ 80

0

Entering edit mode

To be sure, can you from one of the BAM file post the output of samtools idxstats that.bam? Don't worry, we find a solution.

ADD REPLY • link 5.0 years ago by ATpoint 88k

0

Entering edit mode

samtools idxstats SRR7621519_splitncigar.bam
NC_000001.11    248956422       164233  0
NT_187361.1     175055  9       0
NT_187362.1     32032   0       0
NT_187363.1     127682  0       0
...
NT_187686.1     215732  0       0
NT_187687.1     170537  0       0
NT_113949.2     177381  0       0
NC_012920.1     16569   44480   0
*       0       0       0

ADD REPLY • link 5.0 years ago by Seigfried ▴ 80

1

Entering edit mode

The conversion table you need is probably this one here https://github.com/dpryan79/ChromosomeMappings/blob/master/GRCh38_RefSeq2UCSC.txt but I am not sure there is a command line tool for this which does not involve custom awk or similar. I will ask in our Slack if someone has a good idea.

ADD REPLY • link 5.0 years ago by ATpoint 88k

0

Entering edit mode

Thank you for your help! I will try to find if I can find a way to replace in BAM files. Thanks a lot!

ADD REPLY • link 5.0 years ago by Seigfried ▴ 80

1

Entering edit mode

An "official" NCBI assembly report for hg38 is located here:

ftp://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt

There are two columns in this file that are probably relevant to your question: RefSeq-Accn and UCSC-style-name. You can use these as a key-value pairing for lookups.

For example, NC_000001.11 maps to chr1, NT_187361.1 maps to chr1_KI270706v1_random, and so on.

You could pass this file into a Python or Perl script to map these names, also passing in your BAM as SAM (text). Reheader or rewrite by a map lookup, and then convert back from SAM to BAM.

ADD REPLY • link 5.0 years ago by Alex Reynolds 36k