Hi, I have fasta files that contain ambiguity codes. And I would like to convert the sequences so that it's just ACGT and create a new entry for each possibility. i.e. Y = C or T
>sampleID
AGCTAGYACG
into
>sampleID.1
AGCTAGCACG
>sampleID.2
AGCTAGTACG
I was thinking about writing a Perl script that does that using the transliterate function (tr). Before I embark on this, has anyone already written or used some code/software that does this?
I am reasonably sure OP wants this to work for all valid IUPAC codes.
Yep, that would be the idea, I suppose I could take this and have many if statements, but then how do I account for sequences which have multiple IUPAC codes in them?
You sure can. If you want something that works now then use one of the stackoverflow or @Pierre's solution below.
It appears as though you want a new sequence for each ambiguity code encountered. See edited post for reference. Otherwise if you want a new sequence and ALL codes changed at once, it will be slightly different.
As my lab has the limitation to work only with the Sanger sequencer, I needed to get the two versions (ignoring the infinity of combinations/possibilities) from sequences with ambiguity code just to visualize the nucleotide diversity in heterozygous samples. I tested the code above and made some changes.
Input:
Result: