I have two multi-fasta files, one contains nucleotide sequences, and a subset of nucleotide sequence for ORF1 is:
>ORF1_BnaA03g18710D_S45:82:1509
ATGGCCGCCGCAGTTTCCACCGTCGGTGCCATCAACAGAGCTCCGTTGAGCTTGAACGGG
TCAGGAGCAGGAGCTGCTTCAGTCCCAGCTACGACCTTCTTGGGAAAGAAAGTTGTAACC
GCGTCGAGATTCACACAGAGCAACAACAAGAAGAGCAACGGATCATTCAAAGTGGTCGCT
GTCAAAGAAGACAAACAAACCGATGGAGACAGATGGAGGGGACTTGCCTACGACACGTCT
GATGATCAACAAGACATCACCAGAGGCAAAGGTATGGTTGACTCTGTCTTCCAAGCTCCC
ATGGGAACCGGAACTCACAATGCCGTTCTTAGCTCCTATGAGTACATTAGCCAAGGTCTT
AAGCAGTACAACTTGGACAACATGATGGATGGGCTTTACATTGCTCCTGCATTCATGGAC
AAGCTTGTTGTTCACATCACCAAGAACTTCTTGACTTTACCTAACATCAAGGTTCCACTT
ATTTTGGGTATTTGGGGAGGCAAAGGTCAAGGTAAATCCTTCCAGTGTGAGCTTGTCATG
GCCAAGATGGGCATTAACCCAATCATGATGAGTGCTGGAGAGCTTGAGAGTGGAAACGCA
GGAGAACCAGCCAAGCTGATCCGTCAAAGGTACCGTGAAGCAGCAGACATGATCAAAAAG
GGAAAAATGTGTTGTCTATTCATCAACGATCTCGACGCTGGTGCTGGTCGTATGGGTGGT
ACTACYYAGTACACAGTCAACAACCWGATGGTTAACGCAACCYTCATGAACMTTGCTGAT
AACCCAACCAACGTCCAGCTCCCGGGAATGTACAACAAGGAAGAAAACGCACGTGTCCCC
ATCATCGTCACCGGTAACGATTTCTCCACTCTCTACGCACCTCTCATCCGTGACGGGCGT
ATGGAGAARTTCTACTGGGCACCCACACGTGAGGACCKTATTGGTGTCTGCAAGGGTATC
TTCAGGACTGATAACGTTAAGGATGAAGACATTGTCACGCTTGTTGACCAGTTCCCTGGA
CAATCTATCGATTTCTTTGGTGCATTGAGGGCGAGAGTGTACGATGATGAAGTGAGGAAG
TTCGTTGAGGGACTTGGAGTKGAGAAGATAGGAAAGAGGCTGGTGAACTCTAGGGAAGGT
CCTCCAGTGTTCGAGCAACCAGCGATGACTCTTGAGAAGCTTATGGAGTACGGAAACATG
CTTGTGATGGAGCAAGAGAACGTCAAGAGAGTCCAACTTGCTGACCAATACCTTAACGAG
GCTGCCTTGGGAGACGCAAACGCGGACGCCATTGGCCGCGGAACTTTCTATGGGAAAGCA
GCACAGCAAGTGAACCTTCCTGTTCCAGAAGGGTGTACTGATCCTCAAGCCGACAACTTT
GATCCAACAGCTAGAAGTGATGATGGAACTTGTGTCTACAACTTTTGA
The second file contains the corresponding amino acid sequences and subset is:
>ORF1_BnaA03g18710D_S45:82:1509
MAAAVSTVGAINRAPLSLNGSGAGAASVPATTFLGKKVVTASRFTQSNNKKSNGSFKVVA
VKEDKQTDGDRWRGLAYDTSDDQQDITRGKGMVDSVFQAPMGTGTHNAVLSSYEYISQGL
KQYNLDNMMDGLYIAPAFMDKLVVHITKNFLTLPNIKVPLILGIWGGKGQGKSFQCELVM
AKMGINPIMMSAGELESGNAGEPAKLIRQRYREAADMIKKGKMCCLFINDLDAGAGRMGG
TTXYTVNNXMVNATXMNJADNPTNVQLPGMYNKEENARVPIIVTGNDFSTLYAPLIRDGR
MEKFYWAPTREDXIGVCKGIFRTDNVKDEDIVTLVDQFPGQSIDFFGALRARVYDDEVRK
FVEGLGVEKIGKRLVNSREGPPVFEQPAMTLEKLMEYGNMLVMEQENVKRVQLADQYLNE
AALGDANADAIGRGTFYGKAAQQVNLPVPEGCTDPQADNFDPTARSDDGTCVYNF
Now the problem is for some amino acids we have X (the above sequence contains 4 x) instead of amino acid, I am interested to check the corresponding nucleotide sequence for these ambiguous codons, check possible combinations for ambigious nucloeide, and replace with correct amino acids using Codon table.
For example for the first X position which is at position 243 in sequence (TTXYTV), the nucleotide sequence for this x is yag, where Y corresponds to C or T (y=c/t), so the possible combination would be CAG or TAG, CAG codes for Gln(Q) (cag=Gln (Q)), and TAG codes for stop codons (tag= stop codon).
The output may be saved in excel format something like the following image or any suitable format.
Any help will be highly appreciated.
Let's say the possible combinations allow for 2 different amino acids. How would you pick the 'right" one?
I would like to keep both combinations, in the sequence a the yag will become cag and codes for Q, while in the sequence b the yag will become tag and reflects stop codon (*) and the outputs would be like these:
And I think these will be better alternative as compared to excel format.
You should look for a way to expand regexes - some piece of code that can take
[bcr]at
as input and give youbat cat rat
. Once you have that, you should linearize your nucleotide FASTA and replace all ambiguous codes with their corresponding regexes (Y
would become[CT]
for example). Then, apply the regex expansion algorithm to the sequence field of your linearized fasta and you'd get multiple sequences per header. Use a custom awk to write each resultant sequence with a suffixed header and you'll be all set.The key is the regex expansion algorithm though, that's what you'll need to find here.