How To Generate Mutations In Sequences Of A Fasta File?
4
3
Entering edit mode
13.3 years ago
Geparada ★ 1.5k

Hi!

For example, if I have a file with 3 sequences:

>1
gtccgatgat
>2
gatggatcggacgttagcaa
>3
acatttgaga

I want to generate sequences with 10% of the nucleotides randomly mutated.

>1
gtcTgatgat
>2
gaAggatcggacgtGagcaa
>3
acatCtgaga

Does anybody know a program or code to do that?

And if you know a way to do this in python would be great!

Thanks!

fasta python mutation sequence • 16k views
ADD COMMENT
9
Entering edit mode
13.3 years ago
brentp 24k

call the program below like:

python mutate.py some.fasta 0.01 > mutated.fasta

And it will change 1% (0.01) of the bases.

from random import random, choice
import sys
from itertools import groupby

def fasta_iter(fasta_name):
    """
    given a fasta file. yield tuples of header, sequence
    """
    fh = open(fasta_name)
    # ditch the boolean (x[0]) and just keep the header or sequence since
    # we know they alternate.
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        # drop the ">"
        header = header.next()[1:].strip()
        # join all sequence lines to one.
        seq = "".join(s.strip() for s in faiter.next())
        yield header, seq

def main(fasta_name, mutation_freq):

    for header, seq in fasta_iter(fasta_name):
        seq = list(seq)
        for i, s in enumerate(seq):
            val = random()
            if val < mutation_freq:
                # choose a random nucleotide that's different.
                seq[i] = choice([x for x in "ACTG" if x != s.upper()])
        print ">%s\n%s" % (header, "".join(seq))

if __name__ == "__main__":
    main(sys.argv[1], float(sys.argv[2]))
ADD COMMENT
2
Entering edit mode

on 1) i dont disagree but favor simplicity over performance unless it proves a problem. For 2) it probably depends on the mutation rate which is more efficient. And for 3) I guess that depends on what "mutation rate" means.

ADD REPLY
1
Entering edit mode

actually, you were right. I changed it to s.upper() :) ... thanks.

ADD REPLY
0
Entering edit mode

nice! just one few suggestions, 1)choice is really slower than random, as we only want to get one element the corresponding line could be changed using 'ATGC'[int(random()4)] 2) seq can be let as string, seq = seq[i:] + 'ATGC'[int(random()4)] + seq[:i+1]... it is perhaps more efficient 3)why should the mutation result in a different nucleotide?

ADD REPLY
0
Entering edit mode

Nice code brentp! thanks !!!

ADD REPLY
0
Entering edit mode

I've learned a lot from this code... you really help me brentp :)

ADD REPLY
0
Entering edit mode

Am I mistaken to believe this will only give a mutation rate equivalent to 3/4 of the one you specify? Namely, 1 out of 4 of your mutations will be mutations to the same nucleotide, thus not being a mutation? Maybe a while loop would correct for that easily. (while new == old: pick new) or something.

ADD REPLY
0
Entering edit mode

@Eric, yes, you are mistaken. See @fransua's comment. When the threshold is exceeded. It changes the base. That's the part about: choice([x for x in "ACTG" if not x == s.lower()])

ADD REPLY
0
Entering edit mode

I was not sure I understood your list comprehension at first either. Now I see. Thanks

ADD REPLY
0
Entering edit mode

Ups... I didn't saw the error when I analyzed the script. Thanks for the correction!!

ADD REPLY
0
Entering edit mode

This is great. Sorry, Im new to phyton ... how would you go about to modify the script to directly introduce, say, 2 mutations? (I mean, not depending on a mutation frequency, but telling the script to introduce 2 mutations in each given sequence regardless of size). Thanks!

ADD REPLY
7
Entering edit mode
13.3 years ago

The EMBOSS program msbar will do this. E.g. to introduce 10 point mutations

msbar -count 10 -point 4 seq.fasta

The argument '4' allows only base changes, but msbar can produce other types of mutation. There is also an argument to ensure that the mutation does not create another specific sequence (e.g. the one you started with). You can use infoseq to obtain the sequence length so that you know how many mutations to request.

ADD COMMENT
0
Entering edit mode

EMBOSS (the package that msbar is a part of) takes quite a bit of time to compile. Does anyone know of a place to get binaries?

ADD REPLY
0
Entering edit mode

What about Bioconda?

conda install emboss

https://bioconda.github.io/recipes/emboss/README.html

ADD REPLY
0
Entering edit mode

It seems that msbar couldn't work with multiple sequences in the same fasta file, do you have any idea how we could mutate multiple seqs with msbar

ADD REPLY
4
Entering edit mode
13.3 years ago

If this effort is to support research on human genetics, for example, I would add statements that the random variation mimic known frequencies of transitions (T <-> C, or A <-> G) and transversions (A -> C or T, or G -> C or T, C -> G or A, T -> G or A) in the human genome. The most common single nucleotide change in human is C <-> T. Thus, if T is picked as the base to change, then it should change more often to a C than to another base. This would be important if the project wants to align with a biological situation.

The frequency with which a given nucleotide, selected at random by one of the scripts above, changes to any of the other three nucleotides can be taken by scanning the literature for such a table or by parsing dbSNP. I just don't have number on hand.

ADD COMMENT
0
Entering edit mode

+1 if you put more time/realism/thought in, you'll likely get more useful stuff out.

ADD REPLY
2
Entering edit mode
13.3 years ago

You can do this with Biopieces www.biopieces.org) like this:

read_fasta -i test.fna | mutate_seq -p 1 | write_fasta -x

There is also indel_seq for insertions and deletions.

http://code.google.com/p/biopieces/wiki/mutate_seq

http://code.google.com/p/biopieces/wiki/indel_seq

Cheers,

Martin

ADD COMMENT
0
Entering edit mode

This looks nice, but I can't figure out how to install biopieces. It looks like homebrew has dropped support for it. :/

ADD REPLY

Login before adding your answer.

Traffic: 1024 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6