Convert amino acid sequences into nucleotide sequences
3
0
Entering edit mode
13 months ago
sil_bioinfo ▴ 50

How to convert amino acid sequences (big fasta files) into nucleotide sequences, any software tool? I'm using a mac. In the fasta files I have the frames too. Here's a little example:

>abc_frame=-1
SEETQLVPLGWPR*W*PWCLSPSRKTSLDLWHSNTQQCLQAAHSVHLESQFCWKCLSRY*
TCSLMNLCRMYIQ*ISFQSTPVLFLQAV*SNLCSSHQENKRPDR*WSDVDLAAQSQRSAV
STVHPSHMIQLPTAAELQETWFVLNLTCCE
>def_frame=-2
QRKHSWSLWGGRGDGDHGAFPPVVKTPIDSQYWHSNTQQCLQAAHSVHLESQFCWKCLSR
Y*TCSLMNLCSMKLQ*ISFQSTPVLFLQAV*SKL*SSHQENKRPDR*WSDVDLAAQSQRS
AVSTDHPSHMIQLPTAAELQETWFVLNLTCC
>ghi_frame=3
SQHVRFSTNHVSCSSAAVGSWIXCEG*TVDTADLCDCAARSTSDHHLSGLLFSW*LLXXX
DQTACRKRTGVDWNEIYWSFILQRFIKEQVQYRLRHFQQNCDSKWTECAA*RHCCVLLCQ
PTGRGIXGFRLLGKRHTGNSVISHPKGTNCVSS

Additional info:

I have a fasta file with multiple nucleotide sequences, which I then translated into amino acids with seqkit. From those amino acid sequences, I filtered a few, based on some parameters, and I would like to have in another fasta file the amino acid sequences that were discarded (nonprod_seq.fasta) and convert them back to nucleotides.

Here's my code:

seqkit translate nucleotide.fasta --frame 6 --append-frame -o protein.fasta
pip install biopython 

from Bio import SeqIO

seqs = SeqIO.parse(open('protein.fasta'), 'fasta')

def is_productive(s):
    return 'M' in s.seq[:] and (s.count('*')<2)

with open('prod_seq.fasta', 'w') as fw1, open('nonprod_seq.fasta', 'w') as fw2:
    for s in seqs:
        SeqIO.write(s, fw1 if is_productive(s) else fw2, 'fasta')

Is there any way to recover these sequences as nucleotides from the initial nucleotide.fasta file? Maybe from the SeqID? The problem is that the SeqID is not the same, as long as, in the protein.fasta file has the frame added at the end, and there are no blanks in the SeqIDs.

Example of SeqIDs in both files:

nucleotide.fasta

>GAACACGAAGGACGC|PRCONS=Primer_Read1_Rod_Cmu|SEQORIENT=F|CREGION=Inner_Scomax_Cmu|CONSCOUNT=19|DUPCOUNT=8

protein.fasta

>GAACACGAAGGACGC|PRCONS=Primer_Read1_Rod_Cmu|SEQORIENT=F|CREGION=Inner_Scomax_Cmu|CONSCOUNT=19|DUPCOUNT=8_frame=1

Thank you in advance

nucleotide protein • 5.0k views
ADD COMMENT
0
Entering edit mode

The sequences in your example look like amino acid sequences already. In general, you should look for the process called translation, and you will need only two tools for performing the task: either transeq or getOrf. They are part of EMBOSS and can be installed on a Mac.

ADD REPLY
0
Entering edit mode

Hello, I made a mistake. I just edited the question, sorry. I need to convert amino acid sequences with frames into nucleotide sequences. Thank you

ADD REPLY
5
Entering edit mode
13 months ago
Joe 21k

You can't really do this, because of the codon redundancy there is no single correct answer.

That said, I did write some code to do this for fun once:

https://github.com/jrjhealey/bioinfo-tools/blob/master/backtranslate.py

Disclaimer: I haven't run/tested this code in a long time, and you'll have to do some work with the output to have it in a usable form.

A simple usage would be:

python backtranslate.py simple  ../Sandbox/TestData/Proteins.fasta fasta

Input:

>abc_frame=-1
SEETQLVPLGWPRWPWCLSPSRKTSLDLWHSNTQQCLQAAHSVHLESQFCWKCLSRYTCSLMNLCRMYIQISFQSTPVLFLQAVSNLCSSHQENKRPDRWSDVDLAAQSQRSAV

(Truncated) Output:

abc_frame=-1
AGU GAA GAA ACU CAA UUA GUU CCU UUA GGU UGG CCU CGU UGG CCU UGG UGU UUA AGU CCU AGU CGU AAA ACU AGU UUA GAU UUA UGG CAU AGU AAU ACU CAA CAA UGU UUA CAA GCU GCU CAU AGU GUU CAU UUA GAA AGU CAA UUU UGU UGG AAA UGU UUA AGU CGU UAU ACU UGU AGU UUA AUG AAU UUA UGU CGU AUG UAU AUU CAA AUU AGU UUU CAA AGU ACU CCU GUU UUA UUU UUA CAA GCU GUU AGU AAU UUA UGU AGU AGU CAU CAA GAA AAU AAA CGU CCU GAU 
AGC GAG GAG ACC CAG UUG GUC CCC UUG GGC     CCC CGC     CCC     UGC UUG AGC CCC AGC CGC AAG ACC AGC UUG GAC UUG     CAC AGC AAC ACC CAG CAG UGC UUG CAG GCC GCC CAC AGC GUC CAC UUG GAG AGC CAG UUC UGC     AAG UGC UUG AGC CGC UAC ACC UGC AGC UUG     AAC UUG UGC CGC     UAC AUC CAG AUC AGC UUC CAG AGC ACC CCC GUC UUG UUC UUG CAG GCC GUC AGC AAC UUG UGC AGC AGC CAC CAG GAG AAC AAG CGC CCC GAC 
UCU         ACA     CUU GUA CCA CUU GGA     CCA CGA     CCA         CUU UCU CCA UCU CGA     ACA UCU CUU     CUU         UCU     ACA             CUU     GCA GCA     UCU GUA     CUU     UCU                         CUU UCU CGA     ACA     UCU CUU         CUU     CGA         AUA     AUA UCU         UCU ACA CCA GUA CUU     CUU     GCA GUA UCU     CUU     UCU UCU                     CGA CCA     
UCC         ACG     CUC GUG CCG CUC GGG     CCG CGG     CCG         CUC UCC CCG UCC CGG     ACG UCC CUC     CUC         UCC     ACG             CUC     GCG GCG     UCC GUG     CUC     UCC                         CUC UCC CGG     ACG     UCC CUC         CUC     CGG                     UCC         UCC ACG CCG GUG CUC     CUC     GCG GUG UCC     CUC     UCC UCC                     CGG CCG     
UCA                 CUA         CUA             AGA                 CUA UCA     UCA AGA         UCA CUA     CUA         UCA                     CUA                 UCA         CUA     UCA                         CUA UCA AGA             UCA CUA         CUA     AGA                     UCA         UCA             CUA     CUA             UCA     CUA     UCA UCA                     AGA         
UCG                 CUG         CUG             AGG                 CUG UCG     UCG AGG         UCG CUG     CUG         UCG                     CUG                 UCG         CUG     UCG                         CUG UCG AGG             UCG CUG         CUG     AGG                     UCG         UCG             CUG     CUG             UCG     CUG     UCG UCG                     AGG         

The tool has 2 other output styles/formats. The above is simple, the others are pretty:

abc_frame=-1
AGU                 UUA         UUA             CGU                 UUA AGU     AGU CGU         AGU UUA     UUA         AGU                     UUA                 AGU         UUA     AGU                         UUA AGU CGU             AGU UUA         UUA     CGU                     AGU         AGU             UUA     UUA             AGU     UUA     AGU AGU                     CGU         
AGC         ACU     UUG GUU CCU UUG GGU     CCU CGC     CCU         UUG AGC CCU AGC CGC     ACU AGC UUG     UUG         AGC     ACU             UUG     GCU GCU     AGC GUU     UUG     AGC                         UUG AGC CGC     ACU     AGC UUG         UUG     CGC                     AGC         AGC ACU CCU GUU UUG     UUG     GCU GUU AGC     UUG     AGC AGC                     CGC CCU     
UCU GAA GAA ACC CAA CUU GUC CCC CUU GGC     CCC CGA     CCC     UGU CUU UCU CCC UCU CGA AAA ACC UCU CUU GAU CUU     CAU UCU AAU ACC CAA CAA UGU CUU CAA GCC GCC CAU UCU GUC CAU CUU GAA UCU CAA UUU UGU     AAA UGU CUU UCU CGA UAU ACC UGU UCU CUU     AAU CUU UGU CGA     UAU AUU CAA AUU UCU UUU CAA UCU ACC CCC GUC CUU UUU CUU CAA GCC GUC UCU AAU CUU UGU UCU UCU CAU CAA GAA AAU AAA CGA CCC GAU 
UCC GAG GAG ACA CAG CUC GUA CCA CUC GGA UGG CCA CGG UGG CCA UGG UGC CUC UCC CCA UCC CGG AAG ACA UCC CUC GAC CUC UGG CAC UCC AAC ACA CAG CAG UGC CUC CAG GCA GCA CAC UCC GUA CAC CUC GAG UCC CAG UUC UGC UGG AAG UGC CUC UCC CGG UAC ACA UGC UCC CUC AUG AAC CUC UGC CGG AUG UAC AUC CAG AUC UCC UUC CAG UCC ACA CCA GUA CUC UUC CUC CAG GCA GUA UCC AAC CUC UGC UCC UCC CAC CAG GAG AAC AAG CGG CCA GAC 
UCA         ACG     CUA GUG CCG CUA GGG     CCG AGA     CCG         CUA UCA CCG UCA AGA     ACG UCA CUA     CUA         UCA     ACG             CUA     GCG GCG     UCA GUG     CUA     UCA                         CUA UCA AGA     ACG     UCA CUA         CUA     AGA         AUA     AUA UCA         UCA ACG CCG GUG CUA     CUA     GCG GUG UCA     CUA     UCA UCA                     AGA CCG     
UCG                 CUG         CUG             AGG                 CUG UCG     UCG AGG         UCG CUG     CUG         UCG                     CUG                 UCG         CUG     UCG                         CUG UCG AGG             UCG CUG         CUG     AGG                     UCG         UCG             CUG     CUG             UCG     CUG     UCG UCG                     AGG         

And probably the nicest but least useful, boxed (had to use a screenshot since it used characters unsupported by the forum markdown).

enter image description here

ADD COMMENT
0
Entering edit mode

oh, I understand. I was afraid of that... In my case, I have a fasta file with multiple nucleotide sequences, which I then translated into amino acids with seqkit. From those amino acid sequences, I filtered a few, based on some parameters, and I would like to have in another fasta file the amino acid sequences that were discarded and convert them back to nucleotides. Is there any way to recover these sequences as nucleotides from the initial fasta file? Maybe from the sequence name or something? Thanks

ADD REPLY
0
Entering edit mode

If you have the nucleotides sequences, just filter the proteins, extract a list of the IDs that pass your filter, then just re-extract the nuc sequences from the original file based on their ID.

The approach you're trying won't work.

Also, I would question these protein sequences - they are full of stop codons, what are you planning to do with them?

ADD REPLY
0
Entering edit mode

sorry, how can I filter proteins without translating nucleotides into proteins? Could you give me some code sketch for this you say please? I'm still very new in programming

ADD REPLY
1
Entering edit mode

The steps would be as follows:

  1. Have nucleotide sequence fasta
  2. Translate nucleotide seqs to protein sequence fasta
  3. Apply filters as desired to the protein file
  4. Extract the IDs of the translated sequences which pass filter
  5. Use those IDs as keys to extract the original corresponding nucleotide sequence from the file in step 1.

I'm not an expert with it but AFAIK seqkit can do all of this. As long as you don't alter the IDs during the process, you can always match up the DNA to protein sequence.

ADD REPLY
0
Entering edit mode

okay, I will try it. Thank you so much for your help!

ADD REPLY
0
Entering edit mode

Is there a reason you are doing it this way rather than just taking sequences from NCBI or something that are the product of a more rigorous annotation?

ADD REPLY
0
Entering edit mode

I'm working with data that is not yet public... I'm just following instructions

ADD REPLY
0
Entering edit mode

I would still suggest probably using a proper annotation pipeline rather than simple translation if you need realistic gene/protein sequences.

ADD REPLY
0
Entering edit mode

I translated all the nucleotides sequences in all 6 frames. And then, I filtered the ones with the complete reading frame (with Metionine in the sequence and 1 or 2 as maximum stop codons). Well, the example fasta sequences are non productives, which I would like to convert them to nucleotides. I will work with the productive ones but my boss would like to have the non productive ones as nucleotides (I don't know why)

ADD REPLY
0
Entering edit mode

I'd like to note that I feel your definition of 'productive' is a bit problematic (not totally, longer ORFs are more likely to be translated). For example, it has been shown by Ribo-seq that small upstream ORFs of longer coding genes can be really translated. I'll find the publications if you give me some time.

ADD REPLY
0
Entering edit mode

oh yes of course, I would like to read that, thank you!

ADD REPLY
1
Entering edit mode

uORFs are regulatory elements, here are some recent references

Lin Y, May GE, Kready H, Nazzaro L, Mao M, Spealman P, Creeger Y, McManus CJ. Impacts of uORF codon identity and position on translation regulation. Nucleic Acids Res. 2019 Sep 26;47(17):9358-9367. doi: 10.1093/nar/gkz681. PMID: 31392980; PMCID: PMC6755093.

Zhang, H., Wang, Y., Wu, X. et al. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nat Commun 12, 1076 (2021). https://doi.org/10.1038/s41467-021-21394-y

I think this is the first one, or one of the earliest to introduce the concept:

Calvo SE, Pagliarini DJ, Mootha VK. Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc Natl Acad Sci U S A. 2009 May 5;106(18):7507-12. doi: 10.1073/pnas.0810916106. Epub 2009 Apr 16. PMID: 19372376; PMCID: PMC2669787.

ADD REPLY
0
Entering edit mode

the data I'm working with is from a specie of fish.. it's RNA-Seq data.. to find which V and J segments are used etc..

I will check that references though, thank you so much!

ADD REPLY
2
Entering edit mode

@sperezilva: I think, you should pause here for a moment and reconsider your approach and options.

It seems that you lost the big picture over all the nitty-gritty details. What are the research questions you are working with? Did you or your lab generate that RNA-seq data yourself, or is it published? Is there a reference genome of that species or a closely related one?

I am asking because you mentioned V and J segments. That type of analysis is to my best knowledge rarely done with standard RNA-seq, but rather with dedicated sequencing methods like AIRR-seq. I vaguely recall that fish immunology has also many surprises in store - cod has no MHC II and deep-sea anglerfishes also feature odd adaptions. Therefore, I recommend reading the method sections of those or similar papers to learn about their approaches first and carefully check for mentioned pitfalls, rather than devising your own?

ADD REPLY
0
Entering edit mode

yes, the RNA-Seq was generated in our laboratory. Not published. Yes, there is a reference genome of the specie. I'm following instructions, don't worry. I'm still new in programming and I just want to be sure all I'm doing is okay, since I don't have any help of other bioinformatics in our lab.

ADD REPLY
3
Entering edit mode

I'm still new in programming and I just want to be sure all I'm doing is okay, since I don't have any help of other bioinformatics in our lab.

Please heed the warning from @Matthias. Bioinformatics is not simply running programs and typing command lines. It is way too easy to end up with spurious results because programs ran and produced output.

ADD REPLY
0
Entering edit mode

Yeah, I know, thank you. Just please understand since I'm working with data not published I can't talk a lot about it.. I know what I have to do but I just need help with code

ADD REPLY
0
Entering edit mode

Is there a published or draft genome assembly available for the species? Wouldn't it make more sense to map the RNAseq data to the genome and look at which genes have transcriptional signal?

ADD REPLY
0
Entering edit mode

No, there isn't. I don't think I can give a lot of info about it too sorry

ADD REPLY
0
Entering edit mode

But you say here you have a reference genome:

Yes, there is a reference genome of the specie.

ADD REPLY
0
Entering edit mode

the data I'm working with is from a specie of fish.. it's RNA-Seq data.

How did you get to this set of translated protein sequences? Did you assemble the original data? Did you do any QC on assembled transcripts?

ADD REPLY
0
Entering edit mode

laboratory experiments, sequencing, data analysis with some tools, seqkit,...

ADD REPLY
0
Entering edit mode

data analysis with some tools, seqkit,...

If you don't provide complete/accurate information then you are not likely to get usable help.

ADD REPLY
0
Entering edit mode

sorry, preprocessing sequences with pRESTO (Immcantation tool)

ADD REPLY
0
Entering edit mode

If you are already working with pRESTO, why are you trying to come up with your own custom filtering instead of using the tool that is part of the suite to filter the productive ones?

ADD REPLY
0
Entering edit mode

I just checked the link.. I think the tool ParseDb.py will be useful to me later, because after filter these "productive" sequences I will need to use IgBLAST etc. I will save the link, thank you!

ADD REPLY
2
Entering edit mode
13 months ago
Michael 55k

There is backtranseq, also in EMBOSS. You can choose a codon model to deal with the ambiguities in the genetic code. However, that will not get you the original sequences back. If you need the original DNA sequence, you have to retrieve it from your original input file using the identifier.

If you want to do this as part of making an expression vector though, let the provider/company perform the codon optimization. Just select the correct expression system.

ADD COMMENT
0
Entering edit mode

Hello, how could I do this "If you need the original DNA sequence, you have to retrieve it from your original input file using the identifier". I was thinking of a way to retrieve the nucleotide sequences from the original fasta file based on the sequence names (SeqID) and comparing with the nonproductive fasta file (with amino acid seqs)... but I don't know how to put that into code

ADD REPLY
1
Entering edit mode

Yes, that is what Michael and I are saying - you have to use a list of keys and the sequence names/IDs.

look at how to use seqkit grep

ADD REPLY
0
Entering edit mode

Hello, I used:

seqkit seq nonprod_seq.fasta -n -o nonproductiveIDs.txt

to extract the IDs. And then, I used:

seqkit grep -f nonproductiveIDs.txt nucleotide.fasta -o nonproductivenucl.fasta

to extract the file with the nucleotide sequences, but I just get this message on the command line:

[INFO] 608703 patterns loaded from file

The number of sequences is correct, but the output file "nonproductivenucl.fasta" is empty, could you help me please? Thank you

ADD REPLY
0
Entering edit mode

I am guessing but based on this (https://bioinf.shenwei.me/seqkit/usage/#grep) there is no -o flag in the documentation (but I concede there are in the examples).

Try the > redirect example instead and see if that works. Did any sequences print to screen?

ADD REPLY
0
Entering edit mode

Mod note:

apologies, I edited the wrong comment. Please re-type the comment.

ADD REPLY
0
Entering edit mode

I tried without the -o parameter and I still got the same message as before and no sequences in the screen

ADD REPLY
0
Entering edit mode

You will need to share an exerpt of your fasta file and ID file. Most likely the IDs and the sequence file don't match for some reason, or extra work is needed with seqkit to pattern match so you're getting no hits.

ADD REPLY
0
Entering edit mode

I have a problem. The SeqID from the nucleotide.fasta doesn't match with the SeqID from the protein.fasta because in the protein.fasta the frame is added at the end. Example:

nucleotide.fasta

> GAACACGAAGGACGC|PRCONS=Primer_Read1_Rod_Cmu|SEQORIENT=F|CREGION=Inner_Scomax_Cmu|CONSCOUNT=19|DUPCOUNT=8

protein.fasta

>GAACACGAAGGACGC|PRCONS=Primer_Read1_Rod_Cmu|SEQORIENT=F|CREGION=Inner_Scomax_Cmu|CONSCOUNT=19|DUPCOUNT=8_frame=1

There are no blanks in the SeqID

ADD REPLY
1
Entering edit mode

You will need to look at seqkit's regex and other advanced matching tools in that case, since you can't use simple string matches.

  -r, --use-regexp             patterns are regular expression
ADD REPLY
2
Entering edit mode
13 months ago

BBTools has a tool for this:

translate6frames.sh in=proteins.faa out=nucleotides.fna aain aaout=f

The default behavior goes the other way, but sometimes this is useful too, mainly for experimentation.

ADD COMMENT
0
Entering edit mode

I will check it, thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6