Do you know some way to remove the stop codons in an alignment and replace the spaces for hyphen?
2
0
Entering edit mode
5.6 years ago
imda ▴ 10

Do you know some way to remove the stop codons in an alignment and replace the spaces for hyphen?

genome alignment stop codons sed • 4.7k views
ADD COMMENT
0
Entering edit mode

Since you have tagged this with sed you have some idea of how to do this. e.g. sed 's/\-/\ /g' should replaces hyphens with spaces (switch the terms if you want the opposite).

What do you mean by remove stop codons? Are they represented by * or as real nucleotide triplets?

ADD REPLY
0
Entering edit mode

Hi, the thing is that I do not have any idea how to use sed for this purpose. But I was reading and maybe sed is a good tool for this task. Indeed, I have a fasta nucleotide alignment and I want to remove all the stop codons in all my alignments and/or replace them for hyphens.

I just have found this example: sed 's/TAA//g' < inputfilename > outputfilename

However, the problem is that this would be dangerous because the sed command would remove ALL 'TAA' globally.

I also have found this answer from another post:

The best approach is the search for each sequence in a sliding window of 3bp, and search for a string match for 'TAA', 'TGA', 'TAG'. Unless you already have trimmed ORFs, you can simply trim off the last 3bp of each sequence.

Do you have any idea how to do this in sed?

Thank you!

ADD REPLY
1
Entering edit mode

Hello imda ,

you should show us some examples of your alignments. This will help us to help you. Do you know something about the reading frame? Do the sequences all start with the start codon and end with the stop codon?

fin swimmer

ADD REPLY
0
Entering edit mode

Hello finswimmer, I am leaving you an alignment example, and not all the alignments have a start and stop codons.

ADD REPLY
0
Entering edit mode
>cag25822
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------atggca---gag---------------------------------------------------gtgaaa---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------gttcatggt---------------------atttttgcagggccatttaataaa---agagtagaatta------------------gccttgaaactgaagggggtagaatatgaatatattgaagaagataggtcg---aataagagtgctgaacttgtaaagtataatcctatatat------aaacaagttcca------------------------------------gtgcttgtgcat------aatggaaagccaatatgtgagtcactcataattcttgaatatattgatgagacttgggaaagt---gct---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------gctcctctcttgcctaatgatccatatcagaga---tccatc---------gctcgtttctgtgctaacttaattgat------------gataag---------------------------ttaatgggcgca------------atgtacaaagtttgttatggcaaa---ggagaagaaaaggagaaaggc---cttgatgaagtttctgaggtcctaaaatatcttgacaatgaa---ctt------caagac---aag------aaa------ttctttggagga---gacaacattggatttctcgacatcgttgccagttacatagctctctggtttggagcaattcaagaagcaatagggatg---gaactattg---accaaa---caaaagtttaccaagttgagcaaatggattgatgagttcttgtgctgtggaatagtcatggaacat---ctccctactaga---gaatca------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------tta---------------------------------------gtgcctctatacata------------------------------------------------------------------------gctcaatttgaagca---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------gca--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better.
code_formatting

I don't know what the original data looks like so I don't want to make the change myself.

Just to be clear, the example above is an aligned fasta format sequence like below?

>cag25822
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------atggca---gag---------------------------------------------------gtgaaa---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------gttcatggt---------------------atttttgcagggccatttaataaa---agagtagaatta------------------gccttgaaactgaagggggtagaatatgaatatattgaagaagataggtcg---aataagagtgctgaacttgtaaagtataatcctatatat------aaacaagttcca------------------------------------gtgcttgtgcat------aatggaaagccaatatgtgagtcactcataattcttgaatatattgatgagacttgggaaagt---gct---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------gctcctctcttgcctaatgatccatatcagaga---tccatc---------gctcgtttctgtgctaacttaattgat------------gataag---------------------------ttaatgggcgca------------atgtacaaagtttgttatggcaaa---ggagaagaaaaggagaaaggc---cttgatgaagtttctgaggtcctaaaatatcttgacaatgaa---ctt------caagac---aag------aaa------ttctttggagga---gacaacattggatttctcgacatcgttgccagttacatagctctctggtttggagcaattcaagaagcaatagggatg---gaactattg---accaaa---caaaagtttaccaagttgagcaaatggattgatgagttcttgtgctgtggaatagtcatggaacat---ctccctactaga---gaatca------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------tta---------------------------------------gtgcctctatacata------------------------------------------------------------------------gctcaatttgaagca---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------gca--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ADD REPLY
0
Entering edit mode

Yes, very sorry for my bad answer. but yes, that's the ID and the alignment.

ADD REPLY
0
Entering edit mode

I put just one example but I have multi-fasta alignments files

ADD REPLY
1
Entering edit mode

Perhaps I am missing something but how do we know what the reading frame is in an aligned (multi-)fasta file like that and thus decide which triplet to change?

Can you clarify what exactly you are trying to do with this file?

ADD REPLY
0
Entering edit mode

This alignment corresponds to proteins that were aligned to create gene families. Then I used these proteins alignments to create CDS alignments. With this CDS alignments, I am trying to run a program which is called Hyphy. In particular, I want to detect positive selection. The problem is that when I try to run the program, it says to me that my alignments have stop codons and that the program can not run with that stop codons. So, I want to remove all the stop codons in the alignment and that's all.

ADD REPLY
2
Entering edit mode

Did you give the software that thing with all the dashes in it? Are you 100% sure that's the kind of input its expecting? I assume you googled and found this: https://github.com/veg/hyphy/issues/279 Did that not work for you?

ADD REPLY
0
Entering edit mode

Yes, I already have read that. its a nightmare trying to generate the batch files for hyphy.

ADD REPLY
0
Entering edit mode

They are real nucleotide triplets

ADD REPLY
2
Entering edit mode
5.6 years ago
h.mon 35k

See the thread How To Do Alignment, Stop Codon Removal And Dn/Ds Calulation In One Go? for two suggestions.

The first is to use PAL2NAL, the second, ReplaceStopWithRefCodonGaps.pl.

ADD COMMENT
0
Entering edit mode

Thank you! I am trying using Pal2nal...

ADD REPLY
2
Entering edit mode
4.9 years ago

Hello,

the MACSE_V2 toolkit provides several tools to deal with nuceotide coding sequences. It includes a subprogram specifically designed for this task (exportAlignment), which allows to specify the codon (three letters of your choice) that will replace the stop codons. You can even provide two different codons for replacing stops appearing within the sequence (unexpected unless in pseudogenes) and stop codons appearing at the end of the sequences. While there is several options (e.g. to specify the output file name and the genetic code to use) the basic usage is quite straightforward:

java -jar macse.jar -prog exportAlignment -align align.fasta -codonForFinalStop --- -codonForInternalStop NNN
ADD COMMENT

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6