I want to strip predicted nucleotide protein coding sequences of the last 1 or 2 bases which represent incomplete codon.
The input is multifasta consisting of nucleotide protein-coding sequences in reading frame 1. Some of them are trimmed at various positions at the 3'end resulting in 1-2 3'end nucleotides that does not form a codon (these are ignored when I translate the nucleotide sequences to protein sequences; the extra nucleotides are then source of problems in my further workflow).
I tried to extract the longest ORF from the nucleotide sequences to overcome this, however, it does not work in all cases (in some of the input sequences there are in-frame stop codons; other drawback was that using e.g. hmmer2go I could not restrict the search for ORFs only in frame +1 and thus for some of the fastas, alternative ORFs in reverse sense, +2 or +3 reading frames were retrieved as the longest ORFs)
I'd like to ideally use tools such as awk
(and its parameter %
- modulo) to trim down the sequences from 3'end to the longest sequence divisible by 3 - however, I did not manage to make anything to work. Any other solutions are, however, most welcomed as well.
This is perfect. I will toggle this answer as accepted to provide closure to this thread.