I have fasta sequence of different length. But I want to make number of nucleotide in fasta sequences multiple of three by adding AA or -- at the end. Like if the sequence length is 149 then i want to make it 150 by adding -- or any nucleotides.
This looks like an XY Problem. Why do you want to do this multiple of 3 thing? The 3 part tells me if probably has to do with codon usage, but in no scenario can I imagine needing this functionality. Can you explain your end goal please?
Hello, if you can program in Java, you may use the SEDA API (https://github.com/sing-group/seda) in order to create this operation that you need. You can even use the new Java 9 shell to easily do this as I show in this post:
Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. Your reaction was moved but as you can see it's not optimal.
Adding an answer should only be used for providing a solution to the question asked.
If you are planning on using PAML then you'll need to pairwise align those sequences anyway. So the more cleaver thing to do is to "codon-aware" align the nucleotide sequence and strip off anything that is not aligned (== will automatically result in modulo3 'sequences')
Wouter literally just asked you to not add answers. Please be more careful. This should be a comment or comment reply. I've moved it to a comment on the top level post, but it ideally belongs elsewhere. Please read the posts under http://biostars.org/t/how-to to structure your posts better.
Beside the correct question from Ram why you want to do this, here is a python solution:
from Bio import SeqIO
original_file ="./original.fasta"
extended_file ="./extended.fasta"
with open(original_file) as original, open(extended_file, 'w') as extended:
records = SeqIO.parse(original_file, 'fasta')for record in records:
if len(record.seq)%3 == 1:
record.seq = record.seq + "--"elif len(record.seq)%3 == 2:
record.seq = record.seq + "-"
SeqIO.write(record, extended, 'fasta')
why are you still insisting in fiddling about with your input sequences? while the solution offered by finswimmer will work perfectly , several other people have pointed out to you that this is not the way you need to go!
Actually i am working on transposons. And i have to calculate insertion time for LTR pairs present in LTR elements. So basically these are the nucleotide sequences.
Why did we have to wait 20 hours and so many comments before you tell us what you are working on? Don't you think all of this thread is a big waste of time?
anyway the point remains: you should not add bases to get to modulo3 sequences, You first have to align them and subsequently remove unaligned parts (rather than adding bases!!).
Moreover have you looked into the literature on estimating transposon insertion time? There will be software around that specifically does this kind of analysis (though I can not think of one from the top of my head)
"You first have to align them and subsequently remove unaligned parts (rather than adding bases!!" I would be interested in doing this how to do this? Since i was trying to do this "If you're intending to run it through PAML to do dNdS " but again the multiple of 3 errors comes.
It can be done pretty easily with BioPython, as FinSwimmer has shown, particularly with the MutableSeq objects.
If you're intending to run it through PAML to do dNdS etc, then a far more common approach, rather than hacking at your sequences, is just to use a dedicated codon-aware aligner like the imaginatively named CodonAlign
EDIT:
For clarity from the comments we've all made, this should be your workflow:
I'd strongly suggest creating files with all pairwise comparison sequences you're interested in. If you have many sequences, this may rapidly become untenable though, so think very carefully about what you're trying to show/achieve.
Its an n choose k problem so, all pairwise comparisons of even just 100 sequences is 4950.
CRUCIALLY, use a codon-aware aligner as we've mentioned. Do not try to hack your sequences to a multiple of 3 by adding random characters as this will affect your results.
Santise your alignments if necessary. This you may need to do manually.
Can You describe the pipe line. I have thousand of sequences in pair wise alignment. I want to calculate dn/ds of each pairwise alignment in paml. If any script you can suggest, it would be very helpful for me.
To my knowledge (limited as it is in this instance), you can't do this with pairwise alignments, they will need to be multiple sequence alignments, else there isn't sufficient information for the dNdS rates to be meaningful. I could be totally wrong here though.
As for scripts, no I don't know I'm afraid. But CodonAlign, and tools like it, are self contained executables, so you shouldn't need much if any scripting in order to use them.
If I am wrong, then it will probably be as simple as concatenating all your sequence pairs in to files of their own, and then running CodonAlign and PAML for each file in a loop or parallel.
I stand corrected on the pairwise in that case (I've never done it myself), but this is absolutely the workflow, so we all agree there!
I'm curious though, if you're just using pairwise sequences, how can you decide which sequence is 'right' relative to the other? You'd have no other information about what the most common codon in that position is to know whether a particular sequence is drifting/being selected for or against?
Indeed, in a pairwise context you can't tell, but what you are referring to is already step 2 of this kind of procedure. step 1 is to calculate (estimate rather) the dS and dN for each pair. We're not looking for the most common codon or such yet, we're simply looking at the divergence on nucleotide level between 2 genes that once shared a common ancestor (origin).
In step 2 we can start looking at the ration of dS over dN which (coming to think of it) we can actually also do in a pairwise manner. The thing we are looking for is to see whether there are more non-synonymous or more synonymous substitutions happened dN/dS > 1 is positive (driving) selection , = 1 is neutral, < 1 is purifying selection . There should in "theory" not be any bias between codons used , we assume that mutations are happening at a constant rate (kuch ;) ) and thus we are only interested to see if the mutations we observe are more dN or are more dS
This looks like an XY Problem. Why do you want to do this multiple of 3 thing? The
3
part tells me if probably has to do with codon usage, but in no scenario can I imagine needing this functionality. Can you explain your end goal please?Hello, if you can program in Java, you may use the SEDA API (https://github.com/sing-group/seda) in order to create this operation that you need. You can even use the new Java 9 shell to easily do this as I show in this post:
Regards.
Thanks for you reply. I downloaded this tools. But the video is not much clear. which option i should use to make multiple of three nucleotides.
Actually I have to calculate dn/ds using paml. And it does not take any sequences which is not multiplle of three.
Then I strongly advise against adding random nucleotides.
You should have said so in the original question to avoid half the comments you received here :-).
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. Your reaction was moved but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.If you are planning on using PAML then you'll need to pairwise align those sequences anyway. So the more cleaver thing to do is to "codon-aware" align the nucleotide sequence and strip off anything that is not aligned (== will automatically result in modulo3 'sequences')
Ok.. I got it..is there anything to do this in paml. I am aligning those sequences using maft. How should i trim those sequences.
Start by searching the forum for MAFTT posts and if you don't find an appropriate post, ask a new question.
ah, no you will have to do this prior to providing them to PAML.
In my lab we have custom script to do this (not helping much :-) ), but here is a tool that does it as well : trimAl and there will be others around
make sure that you do codon-align in mafft (this is crucial for dN/dS estimations!!)
Specifically
"AA"
or any nucleotide?Thank You very much..it worked..
Wouter literally just asked you to not add answers. Please be more careful. This should be a comment or comment reply. I've moved it to a comment on the top level post, but it ideally belongs elsewhere. Please read the posts under http://biostars.org/t/how-to to structure your posts better.