How to turn current VCF into a VCF with genotype and rsID in same file
0
0
Entering edit mode
5.7 years ago

I currently have a VCF with data similar to the below. I can turn this VCF into a new VCF that has genotypes, and I can turn it into a new VCF that has rsIDs, but I can't seem to figure out how to turn this VCF into one that has both.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  RGSM
chr10   65878   .   C   G   6.98    .   AC=2;AF=1;AN=2;DP=1;FS=0;MQ=48;QD=6.98;SOR=1.609;FractionInformativeReads=1 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB   1/1:0,1:1:1:2:41,3,0:-4.077,-0.301,0:6.977,3.987,3.978:0,34.77,37.77:0,0,1,0:0,0,1,0
chr10   66040   .   CAG C   36.3    .   AC=1;AF=0.5;AN=2;DP=4;FS=0;MQ=42.8;MQRankSum=0.736;QD=9.08;ReadPosRankSum=0.736;SOR=2.225;FractionInformativeReads=0.75;R2_5P_bias=10.37    GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB   0/1:1,2:0.667:3:9:75,0,5:-7.466,0,-0.495:36.3,0.6469,8.596:0,39,42:1,0,0,2:1,0,0,2
chr10   66957   .   C   T   6.59    .   AC=1;AF=0.5;AN=2;DP=3;FS=0;MQ=24.04;MQRankSum=-0.736;QD=2.2;ReadPosRankSum=-0.736;SOR=0.223;FractionInformativeReads=1;R2_5P_bias=-6.11 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB   0/1:1,2:0.667:3:7:40,0,28:-4.028,0,-2.834:6.591,1.078,32.42:0,34.77,37.77:0,1,1,1:1,0,1,1
chr10   71282   .   A   C   13.64   .   AC=2;AF=1;AN=2;DP=2;FS=0;MQ=17;QD=6.82;SOR=0.693;FractionInformativeReads=1 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB   1/1:0,2:1:2:4:49,6,0:-4.946,-0.599,0:13.64,4.95,1.959:0,34.77,37.77:0,0,1,1:0,0,1,1

For reference, this is the kind of format I would like the final VCF to be in (variants are not the same across these two example but you catch my drift):

#chrom  pos id  ref alt qual    filter  info    format  GENOTYPE    genotype_final
chr1    734462  rs12564807  G   A   .   .   .   GT  1/1 GG
chr1    752721  rs3131972   A   G   .   .   .   GT  1/1 AA
chr1    760998  rs148828841 C   .   .   .   .   GT  0/0 CC
chr1    776546  rs12124819  A   .   .   .   .   GT  0/0 AA

I've used bcftools annotate command in conjunction with annotate Homo_sapiens_assembly38.dbsnp138.vcf to get rsIDs from my file, and I've used bcftools query command to try and pull the genotypes out of the original VCF file. Both of these things have been successful on their own, but I just can't figure out how to combine the two of them together to get the data into the format shown above. Any help would be very appreciated, I've spent a lot of time on this so far to no avail.

vcf bcftools vcftools • 3.2k views
ADD COMMENT
0
Entering edit mode

Hello austinkjensen5 ,

just use the output file of bcftools annotate as the input file for bcftools query. You could also chain your commands together with a pipe, e.g. bcftools annotate -Ou ... | bcftools query ...

Please note that your final output is not a valid vcf file.

fin swimmer

ADD REPLY
0
Entering edit mode

Hey fin swimmer, Thank you for the swift help/reply. I'm afraid I may need you to spell this out for me, here is what I just tried and my final output:

bcftools annotate -Ou Homo_sapiens_assembly38.dbsnp138.vcf no_chr_<file-name>.vcf.gz | bcftools query -f '%ID\t  GQ:[ %GQ] \t TGT:[ %TGT]\n' <my-file>.vcf.gz

.     GQ: 2      TGT: G/G
.     GQ: 9      TGT: CAG/C
.     GQ: 7      TGT: C/T

etc

I can confirm that rsIDs are in the Homo_sapiens_assembly38.dbsnp138.vcf.gz, but is not in <my-file>.vcf.gz.

ADD REPLY
0
Entering edit mode

Hello again,

you are missing a -c ID in your bcftools annotate command to tell bcftools to annotate the ID column. Also make sure that the names of the chromosomes (with or without chr) are the same in your vcf and the annotation vcf file.

fin swimmer

ADD REPLY
0
Entering edit mode

As in:

bcftools annotate -c ID -Ou Homo_sapiens_assembly38.dbsnp138.vcf <my-file>.vcf.gz | bcftools query -f '%CHROM\t %POS\t  %ID\t  GQ:[ %GQ] \t TGT:[ %TGT]\n' <my-file>.vcf.gz

This resulted in no difference in my output, though I am currently piping this output into a head like this:

bcftools annotate -c ID -Ou Homo_sapiens_assembly38.dbsnp138.vcf <my-file>.vcf.gz | bcftools query -f '%CHROM\t %POS\t  %ID\t  GQ:[ %GQ] \t TGT:[ %TGT]\n' <my-file>.vcf.gz | head -n 3500

because the file is too large to print out entirely to the terminal. I'm starting to think that maybe the output is actually gathering some rsIDs for some of the lines, but that I'm only looking at a small portion that might not have known rsIDs. Is this possible or is my command still wrong? I can confirm that the names of the chromosomes are the same between <my-file> and Homo_sapiens_assembly38.dbsnp138.vcf . I am also wondering if it matters if my reference file is not gzipped?

ADD REPLY
0
Entering edit mode

Hey finswimmer, I've come back to this problem again and I've sincerely not been able to get this to work even with your advice, do you think that there is any more help that you may be able to provide?

ADD REPLY

Login before adding your answer.

Traffic: 1669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6