I currently have a VCF with data similar to the below. I can turn this VCF into a new VCF that has genotypes, and I can turn it into a new VCF that has rsIDs, but I can't seem to figure out how to turn this VCF into one that has both.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RGSM
chr10 65878 . C G 6.98 . AC=2;AF=1;AN=2;DP=1;FS=0;MQ=48;QD=6.98;SOR=1.609;FractionInformativeReads=1 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/1:0,1:1:1:2:41,3,0:-4.077,-0.301,0:6.977,3.987,3.978:0,34.77,37.77:0,0,1,0:0,0,1,0
chr10 66040 . CAG C 36.3 . AC=1;AF=0.5;AN=2;DP=4;FS=0;MQ=42.8;MQRankSum=0.736;QD=9.08;ReadPosRankSum=0.736;SOR=2.225;FractionInformativeReads=0.75;R2_5P_bias=10.37 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 0/1:1,2:0.667:3:9:75,0,5:-7.466,0,-0.495:36.3,0.6469,8.596:0,39,42:1,0,0,2:1,0,0,2
chr10 66957 . C T 6.59 . AC=1;AF=0.5;AN=2;DP=3;FS=0;MQ=24.04;MQRankSum=-0.736;QD=2.2;ReadPosRankSum=-0.736;SOR=0.223;FractionInformativeReads=1;R2_5P_bias=-6.11 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 0/1:1,2:0.667:3:7:40,0,28:-4.028,0,-2.834:6.591,1.078,32.42:0,34.77,37.77:0,1,1,1:1,0,1,1
chr10 71282 . A C 13.64 . AC=2;AF=1;AN=2;DP=2;FS=0;MQ=17;QD=6.82;SOR=0.693;FractionInformativeReads=1 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/1:0,2:1:2:4:49,6,0:-4.946,-0.599,0:13.64,4.95,1.959:0,34.77,37.77:0,0,1,1:0,0,1,1
For reference, this is the kind of format I would like the final VCF to be in (variants are not the same across these two example but you catch my drift):
#chrom pos id ref alt qual filter info format GENOTYPE genotype_final
chr1 734462 rs12564807 G A . . . GT 1/1 GG
chr1 752721 rs3131972 A G . . . GT 1/1 AA
chr1 760998 rs148828841 C . . . . GT 0/0 CC
chr1 776546 rs12124819 A . . . . GT 0/0 AA
I've used bcftools annotate command in conjunction with annotate Homo_sapiens_assembly38.dbsnp138.vcf to get rsIDs from my file, and I've used bcftools query command to try and pull the genotypes out of the original VCF file. Both of these things have been successful on their own, but I just can't figure out how to combine the two of them together to get the data into the format shown above. Any help would be very appreciated, I've spent a lot of time on this so far to no avail.
Hello austinkjensen5 ,
just use the output file of
bcftools annotate
as the input file forbcftools query
. You could also chain your commands together with a pipe, e.g.bcftools annotate -Ou ... | bcftools query ...
Please note that your final output is not a valid vcf file.
fin swimmer
Hey fin swimmer, Thank you for the swift help/reply. I'm afraid I may need you to spell this out for me, here is what I just tried and my final output:
etc
I can confirm that rsIDs are in the Homo_sapiens_assembly38.dbsnp138.vcf.gz, but is not in <my-file>.vcf.gz.
Hello again,
you are missing a
-c ID
in yourbcftools annotate
command to tellbcftools
to annotate the ID column. Also make sure that the names of the chromosomes (with or withoutchr
) are the same in your vcf and the annotation vcf file.fin swimmer
As in:
This resulted in no difference in my output, though I am currently piping this output into a head like this:
because the file is too large to print out entirely to the terminal. I'm starting to think that maybe the output is actually gathering some rsIDs for some of the lines, but that I'm only looking at a small portion that might not have known rsIDs. Is this possible or is my command still wrong? I can confirm that the names of the chromosomes are the same between <my-file> and Homo_sapiens_assembly38.dbsnp138.vcf . I am also wondering if it matters if my reference file is not gzipped?
Hey finswimmer, I've come back to this problem again and I've sincerely not been able to get this to work even with your advice, do you think that there is any more help that you may be able to provide?