weird characters in GATK vcf tables
0
0
Entering edit mode
3.7 years ago
ziv_attia • 0

I have created a vcf table using GATK using haplotypeCaller, genomicsDBimpirt and genotypesVCF.

the output I get is very different from the vcf4.2 format.

for example:

0/1:8,3:11:36:36,0,233 from vcftools

0|1:2,4:6:72:0|1:4938136_T_C:162,0,72:4938136 #from GATK
               ^_____________^        ^_____^

0|1:2,4:6:72:0:162,0,72 #how it should look like...

This format stuck the downstream pipeline I am used to work with.

Any idea what is it mean / how to get rid of it?

thanks!

genomics • 1.7k views
ADD COMMENT
0
Entering edit mode

Please show us the exact GATK commands you used. This looks like a Find & Replace operation gone wrong.

ADD REPLY
0
Entering edit mode
#this is the code for converting the bam files to g.vcf 

cat RG_bam_list.txt |while read file; do 
/home/pogoda/software/gatk-4.1.6.0/gatk  --java-options "-Xmx24g" HaplotypeCaller  \
    -R /home/pogoda/Sunflower_sorted_bams/Han412-HO.fasta \
    -I ${file} \
    -O /home/pogoda/GATK_microbiome_95_geno/${file}.g.vcf.gz \
    -ERC GVCF
rm ${file}
done

#this is the code for creating the data base

reference=/home/pogoda/Sunflower_sorted_bams/Han412-HO.fasta
int=chr.intervals
DIR=GDBI_96_chr_complete

/home/pogoda/software/gatk-4.1.6.0/gatk --java-options "-Xmx200g -Xms200g" GenomicsDBImport \
-R $reference \
-V B2-18DNA_0010-18_0955_RG.sorted.bam.g.vcf.gz \
-V E1-18DNA_0005-18_0950_RG.sorted.bam.g.vcf.gz \
-V G6-18DNA_0047-18_0930_RG.sorted.bam.g.vcf.gz \
-V F2-18DNA_0014-18_0959_RG.sorted.bam.g.vcf.gz \
-V C6-18DNA_0043-18_0926_RG.sorted.bam.g.vcf.gz \
--genomicsdb-workspace-path /data5/nectar/usftp21.novogene.com/raw_data/GATK_nectar/${DIR} \
--intervals ${int} \
--reader-threads 66 \

#this is the code for making the final VCF table

reference=/home/pogoda/Sunflower_sorted_bams/Han412-HO.fasta

/home/pogoda/software/gatk-4.1.6.0/gatk --java-options "-Xmx166g -Xms116g" CombineGVCFs \
-R $reference \
--variant B2-18DNA_0010-18_0955_RG.sorted.bam.g.vcf.gz \
--variant E1-18DNA_0005-18_0950_RG.sorted.bam.g.vcf.gz \
--variant G6-18DNA_0047-18_0930_RG.sorted.bam.g.vcf.gz \
--variant F2-18DNA_0014-18_0959_RG.sorted.bam.g.vcf.gz \
--variant C6-18DNA_0043-18_0926_RG.sorted.bam.g.vcf.gz \
-O CombineGVCFs.g.vcf.gz

hope this info helps

ADD REPLY
0
Entering edit mode

Thank you. For the example entries you've shown in your question, can you also show us the FORMAT field from the 2 VCF files for those entries?

ADD REPLY
0
Entering edit mode

GATK format field - GT:AD:DP:GQ:PGT:PID:PL:PS

vcftools format field - GT:DP:GL

ADD REPLY
0
Entering edit mode

this is probably the reason. How do format the format of the vcf to contain only the GT:DP:GL fields ?

ADD REPLY
1
Entering edit mode

I don't think GATK giving you more information is necessarily a "problem". You can always extract the info you need from what GATK gives you. You should be able to use bcftools annotate to keep/remove FORMAT fields. Extract a small subset of your GATK VCF file and try processing it with bcftools annotate.

ADD REPLY
0
Entering edit mode

thanks a ton! i will go through it and see how it works

ADD REPLY
0
Entering edit mode

I'm sorry but what should we see ? how any output from GATK should be similar to the 'old' vcftools ? what are the weird characters ? what is the FORMAT column associated to both outputs ?

ADD REPLY

Login before adding your answer.

Traffic: 1551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6