Entering edit mode
9.3 years ago
seta
★
1.9k
Hi everybody,
I used Transdecoder to translate the assembly transcriptome, there is asterisk characters (*
) in the translated sequences indicating stop codon. I plan to use Interproscan on this assembly and *
cause an error. Could you please let me know how I can remove these characters from fasta file? removing is the right or they have to replaced with stop codon, but which of them?! Thanks for any help
At first I thought you were trolling the question-poster, but it turns out that sed (at least as implemented in Cygwin) will interpret '*' as a literal asterisk. However, it might be safer to do
sed -i 's/\\*//g' filename.fasta
just to make it crystal clear to the interpreter to treat '*' as '*.
Indeed, sed can be confusing if one doesn't escape things. Compare
echo "fooo*{1}" | sed "s/o*//g"
,echo "fooo*{1}" | sed "s/o*{1}//g"
andecho "fooo*{1}" | sed "s/*{1}//g"
.This is an old post, but for the benefit of future googlers: a
sed 's/*//g'
solution is absolutely safe in terms of escaping (I would not do it with the -i option though) - there is no reason for panic.tr -d '*'
would be more elegant though. BUT: nothing can replace a format-aware utility, because in a general case stops can appear not only in the end of a protein sequence but also in the middle (which is not expected for Transdecoder though), and asterisks are allowed to appear in headers. The golden standard is emboss'stranseq
which has the-trim
and-clean
options to strip the final or all stops respectively.man tr http://linux.die.net/man/1/tr