Do you know some way to remove the stop codons in an alignment and replace the spaces for hyphen?
Do you know some way to remove the stop codons in an alignment and replace the spaces for hyphen?
See the thread How To Do Alignment, Stop Codon Removal And Dn/Ds Calulation In One Go? for two suggestions.
The first is to use PAL2NAL, the second, ReplaceStopWithRefCodonGaps.pl.
Hello,
the MACSE_V2 toolkit provides several tools to deal with nuceotide coding sequences. It includes a subprogram specifically designed for this task (exportAlignment), which allows to specify the codon (three letters of your choice) that will replace the stop codons. You can even provide two different codons for replacing stops appearing within the sequence (unexpected unless in pseudogenes) and stop codons appearing at the end of the sequences. While there is several options (e.g. to specify the output file name and the genetic code to use) the basic usage is quite straightforward:
java -jar macse.jar -prog exportAlignment -align align.fasta -codonForFinalStop --- -codonForInternalStop NNN
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Since you have tagged this with
sed
you have some idea of how to do this. e.g.sed 's/\-/\ /g'
should replaces hyphens with spaces (switch the terms if you want the opposite).What do you mean by remove stop codons? Are they represented by
*
or as real nucleotide triplets?Hi, the thing is that I do not have any idea how to use sed for this purpose. But I was reading and maybe sed is a good tool for this task. Indeed, I have a fasta nucleotide alignment and I want to remove all the stop codons in all my alignments and/or replace them for hyphens.
I just have found this example:
sed 's/TAA//g' < inputfilename > outputfilename
However, the problem is that this would be dangerous because the sed command would remove ALL 'TAA' globally.
I also have found this answer from another post:
The best approach is the search for each sequence in a sliding window of 3bp, and search for a string match for 'TAA', 'TGA', 'TAG'. Unless you already have trimmed ORFs, you can simply trim off the last 3bp of each sequence.
Do you have any idea how to do this in sed?
Thank you!
Hello imda ,
you should show us some examples of your alignments. This will help us to help you. Do you know something about the reading frame? Do the sequences all start with the start codon and end with the stop codon?
fin swimmer
Hello finswimmer, I am leaving you an alignment example, and not all the alignments have a start and stop codons.
Please use the formatting bar (especially the
code
option) to present your post better.I don't know what the original data looks like so I don't want to make the change myself.
Just to be clear, the example above is an aligned fasta format sequence like below?
Yes, very sorry for my bad answer. but yes, that's the ID and the alignment.
I put just one example but I have multi-fasta alignments files
Perhaps I am missing something but how do we know what the reading frame is in an aligned (multi-)fasta file like that and thus decide which triplet to change?
Can you clarify what exactly you are trying to do with this file?
This alignment corresponds to proteins that were aligned to create gene families. Then I used these proteins alignments to create CDS alignments. With this CDS alignments, I am trying to run a program which is called Hyphy. In particular, I want to detect positive selection. The problem is that when I try to run the program, it says to me that my alignments have stop codons and that the program can not run with that stop codons. So, I want to remove all the stop codons in the alignment and that's all.
Did you give the software that thing with all the dashes in it? Are you 100% sure that's the kind of input its expecting? I assume you googled and found this: https://github.com/veg/hyphy/issues/279 Did that not work for you?
Yes, I already have read that. its a nightmare trying to generate the batch files for hyphy.
They are real nucleotide triplets