Hi
We have analyzed some data-sets for genome analysis using a couple of different pipelines : bwa-GATK, GSNAP-Samtools and for few soybean genomes. Although the final set of output from both the pipeline is in vcf-format which is a widely accepted format for further downstream analysis of SNP-effect prediction etc our collaborators want to get the output in different format that they have generated in the past. This is how that output looks like.
chr position REFERENCE IA3023_ALLELE J105_3_4_ALLELE M20_2_5_2_ALLELE CL0J095_4_6_ALLELE CL0J173_6_8_ALLELE
Gm01 975 G G G G G G
Gm01 3135 T H T T T T
Gm01 3727 T T T T T -
Gm01 3745 G G - - - -
Gm01 3748 G H - - G G
Here - (dash) indicates positions at which there was no reads mapped for that particular genotype. H,V,K etc denote ambiguous nucleotides. I am a newbie when it comes to vcf files as I have not perused all the information available at http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 but I would be interested to know How can I generate something similar using the vcf files and what tools exist to do that already if any. I have both single-sample and multi-sample vcf files. vcf-tools has a lot of options for getting info from vcf files but its documentation is rather scanty so I am not sure if it would fit the bill thus I would really appreciate any insights from folks out there who have done something similar
BTW I just found out vcf-to-tab program in vcf-tools gives the output very close to what I want.
vcf-to-tab <input.vcf > out.tab
'#CHROM POS REF HN001 HN002 HN003 HN004 HN005 HN006 HN007 HN008 HN009 HN010 HN011 HN012 HN013 HN014 HN015 HN018 HN019 HN020 HN021 HN022 HN023 HN024 HN026 HN027 Gm01 5 T T/T T/G T/G ./. T/T T/G T/T ./. T/T ./. T/G T/T T/G T/G T/G T/G T/G T/G T/T T/G T/T T/T T/G T/T Gm01 8 T G/G T/G T/G ./. T/T T/G T/T ./. T/G ./. T/G T/G T/G T/G T/G T/G T/G T/G G/G T/G T/T G/G T/G T/G Gm01 13 G G/A G/G G/G G/G G/G G/G G/G ./. G/G A/A G/G G/G G/G G/G G/G G/G G/G G/G G/A G/G G/A G/A G/G G/A '
Yeah Thanks I had figured that much out I don't know how to substitute actual nt's for the allele calls. I am going through the vcf specs rt now to determine that thanks
If you keep the reference and alt columns you can convert easily between the numbers and the actual nucleotides for the allele calls. You may run in to some difficulties when you have multiple alternative alleles at that position though. If you are comfortable coding in Python you may find the PyVCF package easier to work with to convert your vcf files in to the format you want. I use it routinely to parse VCF files into tab delimited variant reports for collaborators.
Thanks for the reply. I was wondering that only way one could have H in the variant call is when there is a triploid call at that loci for that particular genotype. Let me know if I am correct.