Question

Analyse VCF file to find snps common to all samples in lineage

0

Entering edit mode

8.4 years ago

samuel.lipworth ▴ 30

I have a VCF file with 22 different samples in it. Looking at the example below I know that all of the 2s come from the same lineage from looking at my phylo tree. What can I use to query the vcf so that I can group different isolates from the same lineage together and find out when there is a SNP that affects all isolates from that lineage?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NC000962.fasta.ref  19b0_125_contigs.fa 25c6_159_contigs.fa f0d9_151_contigs.fa 7da4_159_contigs.fa ee25_141_contigs.fa c64b_157_contigs.fa fbd6_161_contigs.fa ed1e_155_contigs.fa 6a63_data_151_contigs.fa    2fec_151_contigs.fa 26af_147_contigs.fa f206_153_contigs.fa e2f1_165_contigs.fa a570_151_contigs.fa df23_151_contigs.fa 1f3c_115_contigs.fa 6213_161_contigs.fa 7ea3_147_contigs.fa 46a56_data_155_contigs.fa   dbc3_159_contigs.fa b1fa_149_contigs.fa

NC_000962   72  GTTCAGGCTT.CACCACAGTG   C   T   40  PASS    NA  GT  1   1   2   1   1   2   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1

SNP VCF • 2.3k views

ADD COMMENT • link updated 8.2 years ago by Pierre Lindenbaum 166k • written 8.4 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

when there is a SNP that affects all isolates from that lineage

Can you perhaps expand this request? I am not sure I understood what you need completely.

ADD REPLY • link 8.4 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

so for example I have 22 bacterial genomes which I know from creating a snp based phylogeny tree cluster into 7 distinct lineages. I can see what these snps are on a core genome alignment multi-fasta file but this is obviously a very slow way to do it so I want to try to get this information from the VCF file which contains the snps from all the samples. Ie somthing a bit like VCFtools --diff-site but working on the columns above which show 1 or 2 depending on whether the sample has the ref necleotide or the snp.

(I can think of a way of doing this with Python but just wanted to know if someone had already done it!)

ADD REPLY • link 8.4 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

I can't help with standard solutions, if it was me I would personally go with a lot of awk, sed, grep, and all the other brothers of them :)

ADD REPLY • link 8.4 years ago by Matteo Schiavinato ★ 3.7k

score 0 · Answer 1 · 2017-03-18

(no tested) using vcffilterjs and the following script:

var pop1=["S1","S2","S3"];
var pop2=["S4","S5","S6"];


function accept(v)
    {
    var pop2allele=[null,null];
    for(var i=0;i< 2;++i)
        {
        var population = (i==0?pop1:pop2);
        for(var j=0;j< population.length;++j) {
            var g=v.getGenotype(population[j]);
            if(g==null || g.getAlleles().size()!=1) continue;
            var allele=g.getAlleles().get(0);
            if(allele.isNoCall()) continue;
            if(pop2allele[i]==null)
                {
                pop2allele[i]=allele;
                }
            else if(!pop2allele[i].equals(allele))
                {
                return false;
                }
            }
        }
    if(pop2allele[0]==null ||pop2allele[1]==null) return false;
    return !pop2allele[0].equals(pop2allele[1]);
    }
accept(variant);