Any ideas on converting IUPAC ambiguity codes in a fasta file
3
0
Entering edit mode
6.8 years ago
SaltedPork ▴ 170

Hi, I have fasta files that contain ambiguity codes. And I would like to convert the sequences so that it's just ACGT and create a new entry for each possibility. i.e. Y = C or T

>sampleID
AGCTAGYACG

into

>sampleID.1
AGCTAGCACG
>sampleID.2
AGCTAGTACG

I was thinking about writing a Perl script that does that using the transliterate function (tr). Before I embark on this, has anyone already written or used some code/software that does this?

fasta iupac regex perl bash • 4.5k views
ADD COMMENT
3
Entering edit mode
6.8 years ago

see all posible sequences from consensus

$ gcc biostars282490.c
$ ./a.out AGCTAGYACG
AGCTAGCACG
AGCTAGTACG
ADD COMMENT
2
Entering edit mode
6.8 years ago
st.ph.n ★ 2.7k

This python code should work if you FASTA is single-line. If not you can linearize it.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                if 'Y' in seq:
                    out.write(header + '.1' + '\n' + 'C'.join(seq.split('Y')))
                    out.write(header + '.2' + '\n' + 'T'.join(seq.split('Y')))
                else:
                    out.write(header + '\n' + seq)

Save as convert.py, run as python convert.py input.fasta output.fasta

EDIT

IUPAC_ambiguity_codes.txt

M       A       C
R       A       G
W       A       T
S       C       G
Y       C       T
K       G       T
V       A       C       G
H       A       C       T
D       A       G       T
B       C       G       T
N       G       A       T       C
  
#!/usr/bin/env python
import sys
from collections import defaultdict

with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
        codes = defaultdict(list)
        for line in amb:
                for line in f:
                        codes[line.strip().split('\t')[0]] = line.strip().split('\t')

with open(sys.argv[1], 'r') as f:
    with open(sys.argv[2], 'w') as out:
        for line in f:
            if line.startswith(">"):
                header = line.strip()
                seq = next(f).strip()
                for c in codes:
                        if c in seq:
                                for n in range(len(codes[c])):
                                        out.write(header + '.' + str(n+1) + '\n' + codes[c][n].join(seq.split(c)))

Untested, but should output a new numbered sequence header for each replacement required to make for each ambiguity code encountered.

ADD COMMENT
1
Entering edit mode

I am reasonably sure OP wants this to work for all valid IUPAC codes.

ADD REPLY
0
Entering edit mode

Yep, that would be the idea, I suppose I could take this and have many if statements, but then how do I account for sequences which have multiple IUPAC codes in them?

ADD REPLY
1
Entering edit mode

You sure can. If you want something that works now then use one of the stackoverflow or @Pierre's solution below.

ADD REPLY
1
Entering edit mode

It appears as though you want a new sequence for each ambiguity code encountered. See edited post for reference. Otherwise if you want a new sequence and ALL codes changed at once, it will be slightly different.

ADD REPLY
0
Entering edit mode

As my lab has the limitation to work only with the Sanger sequencer, I needed to get the two versions (ignoring the infinity of combinations/possibilities) from sequences with ambiguity code just to visualize the nucleotide diversity in heterozygous samples. I tested the code above and made some changes.

#!/usr/bin/env python

import subprocess
import sys
import os
from collections import defaultdict

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

def read_fasta(fasta):
    with open(fasta, 'r') as fast:
        headers, sequences = [], []
        for line in fast:
            if line.startswith('>'):
                head = line.replace('>','').strip()
                headers.append(head)
                sequences.append('')
            else :
                seq = line.strip()
                if len(seq) > 0:
                    sequences[-1] += seq
    return (headers, sequences, fast)


def write_fasta(headers, sequences, fasta):
    with open(fasta, 'w') as fast:
        for i in range(len(headers)):
            fast.write('>' + headers[i] + '\n' + sequences[i] + '\n')

install('biopython')
print (sys.path)
from Bio.SeqIO import FastaIO







with open('IUPAC_ambiguity_codes.txt', 'r') as amb:
        codes = defaultdict(list)
        for line in amb:
            codes[line.strip().split()[0]] = line.strip().split()

try:
    os.makedirs('./msa-wo', exist_ok=True)
    print("Directory ./msa-wo created successfully" )
except OSError as error:
    print("Directory ./msa-wo can not be created")

fasta = read_fasta("big.fasta")
new_fasta = write_fasta(fasta[0], fasta[1],'new_fasta_nobreaklines.fasta')

with open('new_fasta_nobreaklines.fasta', 'r') as f:
    with open("./msa-wo/msa_wo_amb.fasta", 'w') as out:
        for line in f:
            if line.startswith(">"):
                        header = line.strip()
                        seq = next(f).strip()
                        seq1 = seq
                        seq2 = seq
            for c in codes:
                        if c in seq1:
                            seq1 = codes[c][1].join(seq1.split(c))
                        if c in seq2:
                            seq2 = codes[c][2].join(seq2.split(c))
            out.write(header + '.1' + '\n' + seq1 + '\n\n')
            out.write(header + '.2' + '\n' + seq2 + '\n\n') 

print("Output: msa_wo_amb.fasta, check msa-wo folder at convert.py directory") 

Input:

>1f
ACATTA

>2f
WCCCRA

Result:

>1f.1
ACATTA

>1f.2
ACATTA

>2f.1
ACCCAA

>2f.2
TCCCGA
ADD REPLY

Login before adding your answer.

Traffic: 1843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6