Analyse VCF file to find snps common to all samples in lineage
1
0
Entering edit mode
7.9 years ago

I have a VCF file with 22 different samples in it. Looking at the example below I know that all of the 2s come from the same lineage from looking at my phylo tree. What can I use to query the vcf so that I can group different isolates from the same lineage together and find out when there is a SNP that affects all isolates from that lineage?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NC000962.fasta.ref  19b0_125_contigs.fa 25c6_159_contigs.fa f0d9_151_contigs.fa 7da4_159_contigs.fa ee25_141_contigs.fa c64b_157_contigs.fa fbd6_161_contigs.fa ed1e_155_contigs.fa 6a63_data_151_contigs.fa    2fec_151_contigs.fa 26af_147_contigs.fa f206_153_contigs.fa e2f1_165_contigs.fa a570_151_contigs.fa df23_151_contigs.fa 1f3c_115_contigs.fa 6213_161_contigs.fa 7ea3_147_contigs.fa 46a56_data_155_contigs.fa   dbc3_159_contigs.fa b1fa_149_contigs.fa

NC_000962   72  GTTCAGGCTT.CACCACAGTG   C   T   40  PASS    NA  GT  1   1   2   1   1   2   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1
SNP VCF • 2.2k views
ADD COMMENT
0
Entering edit mode

when there is a SNP that affects all isolates from that lineage

Can you perhaps expand this request? I am not sure I understood what you need completely.

ADD REPLY
0
Entering edit mode

so for example I have 22 bacterial genomes which I know from creating a snp based phylogeny tree cluster into 7 distinct lineages. I can see what these snps are on a core genome alignment multi-fasta file but this is obviously a very slow way to do it so I want to try to get this information from the VCF file which contains the snps from all the samples. Ie somthing a bit like VCFtools --diff-site but working on the columns above which show 1 or 2 depending on whether the sample has the ref necleotide or the snp.

(I can think of a way of doing this with Python but just wanted to know if someone had already done it!)

ADD REPLY
0
Entering edit mode

I can't help with standard solutions, if it was me I would personally go with a lot of awk, sed, grep, and all the other brothers of them :)

ADD REPLY
0
Entering edit mode
7.7 years ago

(no tested) using vcffilterjs and the following script:

var pop1=["S1","S2","S3"];
var pop2=["S4","S5","S6"];


function accept(v)
    {
    var pop2allele=[null,null];
    for(var i=0;i< 2;++i)
        {
        var population = (i==0?pop1:pop2);
        for(var j=0;j< population.length;++j) {
            var g=v.getGenotype(population[j]);
            if(g==null || g.getAlleles().size()!=1) continue;
            var allele=g.getAlleles().get(0);
            if(allele.isNoCall()) continue;
            if(pop2allele[i]==null)
                {
                pop2allele[i]=allele;
                }
            else if(!pop2allele[i].equals(allele))
                {
                return false;
                }
            }
        }
    if(pop2allele[0]==null ||pop2allele[1]==null) return false;
    return !pop2allele[0].equals(pop2allele[1]);
    }
accept(variant);
ADD COMMENT

Login before adding your answer.

Traffic: 2291 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6