Question

Any ideas on converting IUPAC ambiguity codes in a fasta file

0

Entering edit mode

7.1 years ago

SaltedPork ▴ 170

Hi, I have fasta files that contain ambiguity codes. And I would like to convert the sequences so that it's just ACGT and create a new entry for each possibility. i.e. Y = C or T

>sampleID
AGCTAGYACG

into

>sampleID.1
AGCTAGCACG
>sampleID.2
AGCTAGTACG

I was thinking about writing a Perl script that does that using the transliterate function (tr). Before I embark on this, has anyone already written or used some code/software that does this?

fasta iupac regex perl bash • 4.8k views

ADD COMMENT • link updated 3.1 years ago by schagas • 0 • written 7.1 years ago by SaltedPork ▴ 170

score 3 · Accepted Answer · 2018-02-14

3

Entering edit mode

7.1 years ago

Pierre Lindenbaum 165k

see all posible sequences from consensus

$ gcc biostars282490.c
$ ./a.out AGCTAGYACG
AGCTAGCACG
AGCTAGTACG

ADD COMMENT • link 7.1 years ago by Pierre Lindenbaum 165k

score 2 · Accepted Answer · 2018-02-14

2

Entering edit mode

7.1 years ago

GenoMax 149k

https://stackoverflow.com/questions/27551921/how-to-extend-ambiguous-dna-sequence

ADD COMMENT • link 7.1 years ago by GenoMax 149k

score 2 · Accepted Answer · 2018-02-14

2

Entering edit mode

7.1 years ago

st.ph.n ★ 2.7k

This python code should work if you FASTA is single-line. If not you can linearize it.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                if 'Y' in seq:
                    out.write(header + '.1' + '\n' + 'C'.join(seq.split('Y')))
                    out.write(header + '.2' + '\n' + 'T'.join(seq.split('Y')))
                else:
                    out.write(header + '\n' + seq)

Save as convert.py, run as python convert.py input.fasta output.fasta

EDIT

IUPAC_ambiguity_codes.txt

M       A       C
R       A       G
W       A       T
S       C       G
Y       C       T
K       G       T
V       A       C       G
H       A       C       T
D       A       G       T
B       C       G       T
N       G       A       T       C

#!/usr/bin/env python
import sys
from collections import defaultdict

with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
        codes = defaultdict(list)
        for line in amb:
                for line in f:
                        codes[line.strip().split('\t')[0]] = line.strip().split('\t')

with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                for c in codes:
                        if c in seq:
                                for n in range(len(codes[c])):
                                        out.write(header + '.' + str(n+1) + '\n' + codes[c][n].join(seq.split(c)))

Untested, but should output a new numbered sequence header for each replacement required to make for each ambiguity code encountered.

ADD COMMENT • link 7.1 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

I am reasonably sure OP wants this to work for all valid IUPAC codes.

ADD REPLY • link 7.1 years ago by GenoMax 149k

0

Entering edit mode

Yep, that would be the idea, I suppose I could take this and have many if statements, but then how do I account for sequences which have multiple IUPAC codes in them?

ADD REPLY • link 7.1 years ago by SaltedPork ▴ 170

1

Entering edit mode

You sure can. If you want something that works now then use one of the stackoverflow or @Pierre's solution below.

ADD REPLY • link 7.1 years ago by GenoMax 149k

1

Entering edit mode

It appears as though you want a new sequence for each ambiguity code encountered. See edited post for reference. Otherwise if you want a new sequence and ALL codes changed at once, it will be slightly different.

ADD REPLY • link 7.1 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

As my lab has the limitation to work only with the Sanger sequencer, I needed to get the two versions (ignoring the infinity of combinations/possibilities) from sequences with ambiguity code just to visualize the nucleotide diversity in heterozygous samples. I tested the code above and made some changes.

#!/usr/bin/env python

import subprocess
import sys
import os
from collections import defaultdict

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

def read_fasta(fasta):
    with open(fasta, 'r') as fast:
        headers, sequences = [], []
        for line in fast:
            if line.startswith('>'):
                head = line.replace('>','').strip()
                headers.append(head)
                sequences.append('')
            else :
                seq = line.strip()
                if len(seq) > 0:
                    sequences[-1] += seq
    return (headers, sequences, fast)


def write_fasta(headers, sequences, fasta):
    with open(fasta, 'w') as fast:
        for i in range(len(headers)):
            fast.write('>' + headers[i] + '\n' + sequences[i] + '\n')

install('biopython')
print (sys.path)
from Bio.SeqIO import FastaIO







with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
        codes = defaultdict(list)
        for line in amb:
            codes[line.strip().split()[0]] = line.strip().split()

try:
    os.makedirs('./msa-wo', exist_ok=True)
    print("Directory ./msa-wo created successfully" )
except OSError as error:
    print("Directory ./msa-wo can not be created")

fasta = read_fasta("big.fasta")
new_fasta = write_fasta(fasta[0], fasta[1],'new_fasta_nobreaklines.fasta')

with open('new_fasta_nobreaklines.fasta', 'r') as f:
    with open("./msa-wo/msa_wo_amb.fasta", 'w') as out:
        for line in f:
            if line.startswith(">"):
                        header = line.strip()
                        seq = next(f).strip()
                        seq1 = seq
                        seq2 = seq
            for c in codes:
                        if c in seq1:
                            seq1 = codes[c][1].join(seq1.split(c))
                        if c in seq2:
                            seq2 = codes[c][2].join(seq2.split(c))
            out.write(header + '.1' + '\n' + seq1 + '\n\n')
            out.write(header + '.2' + '\n' + seq2 + '\n\n') 

print("Output: msa_wo_amb.fasta, check msa-wo folder at convert.py directory")

Input:

>1f
ACATTA

>2f
WCCCRA

Result:

>1f.1
ACATTA

>1f.2
ACATTA

>2f.1
ACCCAA

>2f.2
TCCCGA

ADD REPLY • link 3.1 years ago by schagas • 0