I'm using the EMBOSS transeq
tool to translate the first ORF of 26,000 sequences. Since the tool is pretty slow (takes ~1 minute per short sequence) and to make the process parallel, I split the fasta file into smaller files (down to one sequence per file) and then run transeq
on each file. However, the sequences are renamed in the following format: EMBOSS_001_1
.
How could I prevent transeq
from renaming these sequences?
If nothing else works, I'll create a script that manages the translation of each individual sequences and makes sure to rename it after it has been translated.
From your code I am guessing that the issue is that you don't want the reading frame suffix (e.g. "_1") to be added to the sequence identifier by EMBOSS transeq?
Assuming that that is the case, it might be worth asking if there is a way to suppress the suffix addition on the EMBOSS mailing lists (see http://emboss.open-bio.org/html/use/ch03s04.html), so the developers can have a look.
Sorry, that is not what I am looking for. My sequences have names and those are completely removed and replaced by EMBOSS_001_1. I would like to retain the original names.
It is the FASTA sequence ID which is being renamed to EMBOSS_001_1 ? I have not seen that happen before. For example, my sequence:
>seq1
becomes
>seq1_1
What do your IDs look like?
My sequences were named after the info from annotating them with Maker2. Here are some examples:
That's interesting; on my machine, using those IDs causes transeq to hang with a high CPU load. Removing the pipe symbol from the ID fixes that issue. So I guess the pipes are the problem and maybe you have a newer version of EMBOSS which deals with this by renaming.
The presence of the '|' triggers identifier parsing to extract the database name, accession, entry name, etc. In the EMBOSS 6.6.0 release there was a bug that meant this parsing behaved strangely when using a two field pipe separated identifier (two fields using colon separation is fine as is three or more pipe separated fields). This was fixed after release, and should be available in the post release patches (see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/).
In that case, since you are using fasta formatted input sequences and want to preserve the identifiers as provided in the headers, you will need to use the 'pearson' format explicitly rather than using 'fasta' or format auto-detection. The 'pearson' format treats the identifier as the first non-whitespace token on the header line, and does not attempt to parse structured identifiers. So add '-sformat pearson' to your command-line, and you output should have the expected identifiers with the addition of the '_1' frame suffix.
Hi hpmcwill,
Could you please post this as an answer so that I can accept it? Thanks to you and Neilfws for helping find the bug. It both makes the process 1000 times faster (I was wondering how it could take one minute to translate a short sequence...) and retains the sequence name.
Consider it done :-)