change specific bases in a fasta file

0

Entering edit mode

7.8 years ago

samuel.lipworth ▴ 30

Hi,

I have a list of positions I would like to change in a fasta file with the base I would like to change to tab separated as follows:

1 T 2 C 10 T 50 G

etc

so that eg: ATGTGTC...

becomes TCGTGTC...

There must be a way to do this but I can't do it - have tried to code it in biopython but can't manage.. any ideas?

fasta biopython • 7.7k views

ADD COMMENT • link updated 2.5 years ago by erdagnese • 0 • written 7.8 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

post what have you tried in biopython.

ADD REPLY • link 7.8 years ago by GouthamAtla 12k

0

Entering edit mode

see below, thanks...

ADD REPLY • link 7.8 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

import csv
from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio.Seq import MutableSeq
from sys import argv
import sys
script, reference, changes, out = argv
changes_=open(changes, 'rb')
change = csv.reader(changes_, delimiter='\t')
for line in change:
        pos = line[0]
        base = line[1]
        with open(out, 'w') as output:
                with open(reference, 'rb') as fasta_file:
                        for seq_record in SeqIO.parse(fasta_file, "fasta"):
                                seq_record[pos]=base
                                SeqIO.write(fasta_file, output, 'fasta')

I know this is not quite right but not sure how to make it work

ADD REPLY • link 7.8 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

you don't need to import so many modules. Also, you need to subtract 1 from your positions, because python starts at 0.

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

This is the solution you need but you need to work a little bit to prepare input files

Introducing Known Mutations (From A Vcf) Into A Fasta File

ADD REPLY • link 7.8 years ago by venu 7.1k

3

Entering edit mode

7.8 years ago

st.ph.n ★ 2.7k

You don't specify how many sequences are in your file, or how to correlate the positions/changes to each sequence. I suggest adding header information, if there is more than one sequence to your position table. The code below assumes multiple sequences, but will work with only one as well.

First, change this 1 T 2 C 10 T 50 G

to this (tab-delimited, pos.txt):

head1 1 T
head1 2 C 
head1 10 T 
head1 50 G
....
head2 ....

Linearize fasta (a preference I have. You can still read in the sequences with Biopython).

#!/usr/bin/env python
from collections import defaultdict

with open('pos.txt', 'r') as f:
        pos = defaultdict(list)
        for line in f:
                pos[line.strip().split('\t')[0]].append((int(line.strip().split('\t')[1]), line.strip().split('\t')[2]))

with open('input.fasta', 'r') as fasta:
        with open('input_corr.fasta', 'w') as out:
                for line in fasta:
                        if line.startswith(">"):
                                h = line.strip().split('>')[1]
                                s = list(next(fasta).strip())
                                if h in pos:
                                        for n in pos[h]:
                                                s[n[0]-1] = n[1]
                                        out.write('>' + h + '\n' + ''.join(s))

ADD COMMENT • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

This was a super helpful fix and the only one I got to work for the same issue of needing to change SNP bp in a fasta based on position.txt in each entry of a multiline fasta file. So thank you! But needed a minor fix to make sure the >header line for each was followed by the seq string on separate lines: the final line should read: out.write('>' + h + '\n' + ''.join(s) + '\n')

ADD REPLY • link 2.5 years ago by erdagnese • 0

2

Entering edit mode

7.8 years ago

Matt Shirley 10k

You can use the MutableFastaRecord in pyfaidx for this exact purpose.

	from pyfaidx import Fasta

	# positions.txt:
	# chr1 1 T
	# chr1 100 C
	# chr2 10 G
	# ...

	with open('positions.txt') as mut_table:
	# mutable Fasta modifies input file in-place
	# make sure you're editing a copy of the original file
	with Fasta('input.fasta', mutable=True) as fasta:
	for line in mut_table:
	rname, pos, base = line.rstrip().split()
	# convert 1-based to 0-based coordinates
	fasta[rname][int(pos) - 1] = base

view raw answer.py hosted with ❤ by GitHub

The mutable Fasta instance will modify your file in-place (commits changes directly back to disk) so be careful which file you're editing.

ADD COMMENT • link 7.8 years ago by Matt Shirley 10k

0

Entering edit mode

I know that it has been a while.

This is a very nice solution. But this has one problem. It does not include chromosomes. So, how can it actually find the right position to replace?

ADD REPLY • link 5.3 years ago by mgdias.jose ▴ 20

0

Entering edit mode

It appears that the original question was a bit vague. The code I provided will replace the first entry in the FASTA file. I'll update the Gist with a bit more information and adapt it to work on a multiFASTA file.

ADD REPLY • link 5.2 years ago by Matt Shirley 10k

0

Entering edit mode

7.8 years ago

samuel.lipworth ▴ 30

Thanks for the help!

I also managed to write my own version which seems to work to!

https://github.com/samlipworth/base_changer

ADD COMMENT • link 7.8 years ago by samuel.lipworth ▴ 30

0

Entering edit mode

and for the same position in different chromossomes?

ADD REPLY • link 5.4 years ago by mgdias.jose ▴ 20

Login before adding your answer.