Question

How to use ExAC/gnomAD without chrN_random?

0

Entering edit mode

5.0 years ago

dodausp ▴ 190

Hi,

I am trying to estimate my sample contamination based on gatk's GetPileupSummaries / CalculateContamination best practices. This pipeline requires a .vcf file containing common germline variant sites, which here I am resourcing from the ExAC database (now migrated to gnomAD). However, when using GetPileupSummaries, I get the following error:

A USER ERROR has occurred: Input files reads and features have incompatible contigs: Found contigs with the same name but different lengths:
contig reads = chrM / 16569
contig features = chrM / 16571.

reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
features contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231, chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]

I expected this issues to appear, because the samples were aligned to the hg19 build, but without the chrN_random tables. So, my question is: is there any way that one can edit, or avoid, those chrN_random annotations on the ExAC file? (the file is quite large though - ~6.5GB, not feasible to open on a text editor).

Any help is greatly appreciated!

PS: one solution would be to realign my samples to the whole hg19 build. However, I would like to take this as a last resort, once this step is performed by the sequencing pipeline (IonTorrent), which already accounts for technical bias (i.e. in homopolymer regions) on their data pre-processing.

sequencing variant calling SNP ExAc • 1.9k views

ADD COMMENT • link 5.0 years ago by dodausp ▴ 190

0

Entering edit mode

not feasible to open on a text editor

try excel !!! :-D

ADD REPLY • link 5.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

5.0 years ago

Pierre Lindenbaum 164k

This is just a problem of incompatibles dictionaries (using gl*contigs and different version of the MT chromosome ) see problems in hg19 and b37 compatibility .

You could 'cheat' this by removing the 'non 1-22/XY' variants from your vcf file using bcftools and then inject the Gnomad/Exact dictionary into the VCF using : https://broadinstitute.github.io/picard/command-line-overview.html#UpdateVcfSequenceDictionary

ADD COMMENT • link 5.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks a lot, @Pierre Lindenbaum! (: Trying that right now. Will post an update later.

ADD REPLY • link 5.0 years ago by dodausp ▴ 190

0

Entering edit mode

I tried, and unfortunately it did not work @Pierre Lindenbaum. I did come up with something else that solved the problem, though (explained below). Thanks a lot for the suggestion! (:

ADD REPLY • link 5.0 years ago by dodausp ▴ 190

score 1 · Accepted Answer · 2019-12-13

Case solved.

Basically, I liftedOver the ExAC file to the hg19 build used on our sequecing platform (hg19 without those chrN_random annotations), and arranged it so gatk's GetPileupSummaries could interpret it. In case anybody comes across that, here are the steps:

1. LiftingOver ExAC file (picard)

java -jar picard.jar LiftoverVcf I=ExAC.r1.sites.vep.vcf.gz \
O=output_1.vcf.gz \
CHAIN=b37tohg19.chain \
REJECT=rejected_variants.vcf \
R=/genome/reference/file.fasta \
ALLOW_MISSING_FIELDS_IN_HEADER=true MAX_RECORDS_IN_RAM=100000

2. Fixing the header of the file (picard)

java -jar picard.jar FixVcfHeader \
I=output_1.vcf.gz \
O=output_2.vcf.gz

3. Extracting only biallelic entries (gatk)

gatk SelectVariants \
-R /genome/reference.fasta \
-V output_2.vcf.gz \
--restrict-alleles-to BIALLELIC \
-O output_3.vcf.gz

The output_3.vcf.gz is the one to be used on GetPileupSummaries.

In any case, thanks a lot for the suggestion @Pierre Lindenbaum!

And I hope it can help others. (: