I slightly modified the script provided by finswimmer. Now it's compatible with python 2.X, writes output to the text file instead of STDOUT. Besides, ID of each SNP is present in output file as separate field. Probably, some additional optimisation and syntax polishing are required, but even this version works fine for me.
from __future__ import print_function
import pysam
f = open("SNP_CDS.txt", "w")
vcf = pysam.VariantFile("test.vcf")
genome = pysam.FastaFile("genome.fasta")
L=[]
flank=10
for record in vcf:
seq = genome.fetch(record.chrom, record.pos-1-flank, record.pos-1+len(record.ref)+flank)
t=(seq, record.chrom, str(record.pos), record.id, record.ref, record.alts[0])
L.append(t)
for i in L:
seq,chr,pos,id,ref,alt=i
sequence='{}[{}/{}]{}'.format(seq[:flank],ref,alt, seq[flank+len(ref):])
line='{}\t{}\t{}\t{}'.format(chr, pos, id, sequence)
f.write(line+'\n')
f.close()
@finswimmer Hi! Many thanks. I need some time for the code understanding.
I added some comments to the code. If something isn't clear just ask.
Hi swimmer,
thanks so much for this solution! I used it in my thesis project and would like to reference it properly and cite the source. Just wondering if you have a GitHub repository or something similar that I could refer to? many thanks Laura