change specific bases in a fasta file
3
0
Entering edit mode
7.3 years ago

Hi,

I have a list of positions I would like to change in a fasta file with the base I would like to change to tab separated as follows:

1 T 2 C 10 T 50 G

etc

so that eg: ATGTGTC...

becomes TCGTGTC...

There must be a way to do this but I can't do it - have tried to code it in biopython but can't manage.. any ideas?

fasta biopython • 7.0k views
ADD COMMENT
0
Entering edit mode

post what have you tried in biopython.

ADD REPLY
0
Entering edit mode

see below, thanks...

ADD REPLY
0
Entering edit mode
import csv
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.Seq import MutableSeq
from sys import argv
import sys
script, reference, changes, out = argv
changes_=open(changes, 'rb')
change = csv.reader(changes_, delimiter='\t')
for line in change:
        pos = line[0]
        base = line[1]
        with open(out, 'w') as output:
                with open(reference, 'rb') as fasta_file:
                        for seq_record in SeqIO.parse(fasta_file, "fasta"):
                                seq_record[pos]=base
                                SeqIO.write(fasta_file, output, 'fasta')

I know this is not quite right but not sure how to make it work

ADD REPLY
0
Entering edit mode

you don't need to import so many modules. Also, you need to subtract 1 from your positions, because python starts at 0.

ADD REPLY
0
Entering edit mode

This is the solution you need but you need to work a little bit to prepare input files

Introducing Known Mutations (From A Vcf) Into A Fasta File

ADD REPLY
3
Entering edit mode
7.3 years ago
st.ph.n ★ 2.7k

You don't specify how many sequences are in your file, or how to correlate the positions/changes to each sequence. I suggest adding header information, if there is more than one sequence to your position table. The code below assumes multiple sequences, but will work with only one as well.

First, change this 1 T 2 C 10 T 50 G

to this (tab-delimited, pos.txt):

head1 1 T
head1 2 C 
head1 10 T 
head1 50 G
....
head2 ....

Linearize fasta (a preference I have. You can still read in the sequences with Biopython).

#!/usr/bin/env python
from collections import defaultdict

with open('pos.txt', 'r') as f:
        pos = defaultdict(list)
        for line in f:
                pos[line.strip().split('\t')[0]].append((int(line.strip().split('\t')[1]), line.strip().split('\t')[2]))

with open('input.fasta', 'r') as fasta:
        with open('input_corr.fasta', 'w') as out:
                for line in fasta:
                        if line.startswith(">"):
                                h = line.strip().split('>')[1]
                                s = list(next(fasta).strip())
                                if h in pos:
                                        for n in pos[h]:
                                                s[n[0]-1] = n[1]
                                        out.write('>' + h + '\n' + ''.join(s))
ADD COMMENT
0
Entering edit mode

This was a super helpful fix and the only one I got to work for the same issue of needing to change SNP bp in a fasta based on position.txt in each entry of a multiline fasta file. So thank you! But needed a minor fix to make sure the >header line for each was followed by the seq string on separate lines: the final line should read: out.write('>' + h + '\n' + ''.join(s) + '\n')

ADD REPLY
2
Entering edit mode
7.3 years ago

You can use the MutableFastaRecord in pyfaidx for this exact purpose.

The mutable Fasta instance will modify your file in-place (commits changes directly back to disk) so be careful which file you're editing.

ADD COMMENT
0
Entering edit mode

I know that it has been a while.

This is a very nice solution. But this has one problem. It does not include chromosomes. So, how can it actually find the right position to replace?

ADD REPLY
0
Entering edit mode

It appears that the original question was a bit vague. The code I provided will replace the first entry in the FASTA file. I'll update the Gist with a bit more information and adapt it to work on a multiFASTA file.

ADD REPLY
0
Entering edit mode
7.3 years ago

Thanks for the help!

I also managed to write my own version which seems to work to!

https://github.com/samlipworth/base_changer

ADD COMMENT
0
Entering edit mode

and for the same position in different chromossomes?

ADD REPLY

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6