Entering edit mode
7.9 years ago
samuel.lipworth
▴
30
I have a VCF file with 22 different samples in it. Looking at the example below I know that all of the 2s come from the same lineage from looking at my phylo tree. What can I use to query the vcf so that I can group different isolates from the same lineage together and find out when there is a SNP that affects all isolates from that lineage?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NC000962.fasta.ref 19b0_125_contigs.fa 25c6_159_contigs.fa f0d9_151_contigs.fa 7da4_159_contigs.fa ee25_141_contigs.fa c64b_157_contigs.fa fbd6_161_contigs.fa ed1e_155_contigs.fa 6a63_data_151_contigs.fa 2fec_151_contigs.fa 26af_147_contigs.fa f206_153_contigs.fa e2f1_165_contigs.fa a570_151_contigs.fa df23_151_contigs.fa 1f3c_115_contigs.fa 6213_161_contigs.fa 7ea3_147_contigs.fa 46a56_data_155_contigs.fa dbc3_159_contigs.fa b1fa_149_contigs.fa
NC_000962 72 GTTCAGGCTT.CACCACAGTG C T 40 PASS NA GT 1 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Can you perhaps expand this request? I am not sure I understood what you need completely.
so for example I have 22 bacterial genomes which I know from creating a snp based phylogeny tree cluster into 7 distinct lineages. I can see what these snps are on a core genome alignment multi-fasta file but this is obviously a very slow way to do it so I want to try to get this information from the VCF file which contains the snps from all the samples. Ie somthing a bit like VCFtools --diff-site but working on the columns above which show 1 or 2 depending on whether the sample has the ref necleotide or the snp.
(I can think of a way of doing this with Python but just wanted to know if someone had already done it!)
I can't help with standard solutions, if it was me I would personally go with a lot of awk, sed, grep, and all the other brothers of them :)