import csv
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.Seq import MutableSeq
from sys import argv
import sys
script, reference, changes, out = argv
changes_=open(changes, 'rb')
change = csv.reader(changes_, delimiter='\t')
for line in change:
pos = line[0]
base = line[1]
with open(out, 'w') as output:
with open(reference, 'rb') as fasta_file:
for seq_record in SeqIO.parse(fasta_file, "fasta"):
seq_record[pos]=base
SeqIO.write(fasta_file, output, 'fasta')
I know this is not quite right but not sure how to make it work
You don't specify how many sequences are in your file, or how to correlate the positions/changes to each sequence. I suggest adding header information, if there is more than one sequence to your position table. The code below assumes multiple sequences, but will work with only one as well.
First, change this 1 T 2 C 10 T 50 G
to this (tab-delimited, pos.txt):
head1 1 T
head1 2 C
head1 10 T
head1 50 G
....
head2 ....
Linearize fasta (a preference I have. You can still read in the sequences with Biopython).
#!/usr/bin/env python
from collections import defaultdict
with open('pos.txt', 'r') as f:
pos = defaultdict(list)
for line in f:
pos[line.strip().split('\t')[0]].append((int(line.strip().split('\t')[1]), line.strip().split('\t')[2]))
with open('input.fasta', 'r') as fasta:
with open('input_corr.fasta', 'w') as out:
for line in fasta:
if line.startswith(">"):
h = line.strip().split('>')[1]
s = list(next(fasta).strip())
if h in pos:
for n in pos[h]:
s[n[0]-1] = n[1]
out.write('>' + h + '\n' + ''.join(s))
This was a super helpful fix and the only one I got to work for the same issue of needing to change SNP bp in a fasta based on position.txt in each entry of a multiline fasta file. So thank you! But needed a minor fix to make sure the >header line for each was followed by the seq string on separate lines: the final line should read: out.write('>' + h + '\n' + ''.join(s) + '\n')
It appears that the original question was a bit vague. The code I provided will replace the first entry in the FASTA file. I'll update the Gist with a bit more information and adapt it to work on a multiFASTA file.
post what have you tried in biopython.
see below, thanks...
I know this is not quite right but not sure how to make it work
you don't need to import so many modules. Also, you need to subtract 1 from your positions, because python starts at 0.
This is the solution you need but you need to work a little bit to prepare input files
Introducing Known Mutations (From A Vcf) Into A Fasta File