Dear Biostars community,
My aim is to implement the following pipeline (albeit modified) in order to generate phylogenetic trees from whole-genome sequencing of multiple closely related bacterial samples:
http://userweb.eng.gla.ac.uk/cosmika.goswami/snp_calling/SNPCalling.html
However - I have a problem when I reach the stage of the pipeline where I use vcf2phyloviz.py (https://github.com/nickloman/misc-genomics-tools/blob/master/scripts/vcf2phyloviz.py). The aim of this script is to generate a pseudo-alignment of called SNPs from a merged VCF file e.g.
>ref
GCGGCNGTGGCGAGTGGCAGG
>Sample1
GCGGCATTTGCTGATGGTAGG
>Sample3
TTAATAGTGATGAGCTACGAA
>Sample2
GTAGCNGAGATGAGCTACGAA
The problem I am having is that logic in the vcf2phyloviz script seems to indicate that a SNP base position will not be appended to the alignment unless all sample sequences (contained in the VCF) have a nucleotide differing from the reference (i.e. all samples contain a SNP - though not all necessarily the same SNP).
This would force a psuedosequence of SNPs in which all nucleotides in the sample sequence would necessarily differ from the reference nucleotides
I am not sure if this is the intended behaviour of the script, or I am just running the script incorrectly. I have tried for many hours to identify the potential error that I could have made, but I can't seem to identify any.
The command that I run is as follows;
python2.7 vcf2phyloviz.py --metadata metadata.txt --output_prefix snps example.vcf
The first record of my input file looks like this (remaining header information omitted):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-1-1.A.fq.sam.bam.variant8 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-1-2.A.fq.sam.bam.variant4 bam/Escher
ichia_coli.HUSEC2011CHR1.dna.root-1-1.A.fq.sam.bam.variant10 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-2-1.A.fq.sam.bam.variant12 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-2-2.A.fq.sam.bam.variant11 bam/
Escherichia_coli.HUSEC2011CHR1.dna.root-1-2.A.fq.sam.bam.variant9 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1.A.fq.sam.bam.variant14 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-1-1.A.fq.sam.bam.variant3 bam/Escherichi
a_coli.HUSEC2011CHR1.dna.root-2-1-2.A.fq.sam.bam.variant6 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-1.A.fq.sam.bam.variant15 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-2-1.A.fq.sam.bam.variant5 bam/Escherichia_coli.HUS
EC2011CHR1.dna.root-2-2-2.A.fq.sam.bam.variant2 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-2.A.fq.sam.bam.variant13 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2.A.fq.sam.bam.variant7 bam/Escherichia_coli.HUSEC2011CHR1
.dna.root.A.fq.sam.bam.variant
Chromosome 191 . T G 212.68 . AB=0;ABP=0;AC=1;AF=1.00;AN=1;AO=10;CIGAR=1X;SDP=10;SDPB=10;SDPRA=0;EPP=3.87889;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=48.972;PAIRED=1;PAIRE
DR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=282;QR=0;RO=0;RPL=4;RPP=3.87889;RPPR=0;RPR=6;RUN=1;SAF=10;SAP=24.725;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp;set=variant3 GT:AO:SDP:SDPR:PL:QA:QR:RO .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:
.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. 1:10:10:10,10:257,0:282:0:0 .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:.
So for, example, this record would be omitted for the alignment because not all samples contains a SNPs (i.e. a nucleotide differing from the reference)
The logic in the vcf2phyloviz.py script which I think is casing the omission of these records are as follows:
for record in vcf_records:
samples_to_use = [sample for sample in record.samples if sample.sample not in ignore_samples]
bases = [sample.gt_bases for sample in samples_to_use if sample.gt_bases is not None]
if len(bases) != len(samples_to_use):
continue
The problem seems to be that if there is no entry for a particular sample of a given record - in my particular case - sample.gt_bases does not return any nucleotides - as a result, the subsequent test for a full base list fails.
So my question is as follows:
Is this the intended behaviour of the vcf2phyloviz.py script, or am I making some form of mistake?
Thanks,
NB: I have at the example case given in the URL I have given at the top of this post, and their example output they give for use of vcf2phyloviz indicates that this is not the expected behaviour of this script i.e. they produce alignments containing nucleotides which match the reference sequence