How to trim nucleotide sequences to length divisible by 3
1
0
Entering edit mode
6.9 years ago
al-ash ▴ 210

I want to strip predicted nucleotide protein coding sequences of the last 1 or 2 bases which represent incomplete codon.

The input is multifasta consisting of nucleotide protein-coding sequences in reading frame 1. Some of them are trimmed at various positions at the 3'end resulting in 1-2 3'end nucleotides that does not form a codon (these are ignored when I translate the nucleotide sequences to protein sequences; the extra nucleotides are then source of problems in my further workflow).

I tried to extract the longest ORF from the nucleotide sequences to overcome this, however, it does not work in all cases (in some of the input sequences there are in-frame stop codons; other drawback was that using e.g. hmmer2go I could not restrict the search for ORFs only in frame +1 and thus for some of the fastas, alternative ORFs in reverse sense, +2 or +3 reading frames were retrieved as the longest ORFs)

I'd like to ideally use tools such as awk (and its parameter % - modulo) to trim down the sequences from 3'end to the longest sequence divisible by 3 - however, I did not manage to make anything to work. Any other solutions are, however, most welcomed as well.

awk incomplete codon • 3.2k views
ADD COMMENT
4
Entering edit mode
6.9 years ago
al-ash ▴ 210

I'm not sure what is the correct protocol for answering my own question.

In the end, my input were multiple fasta files each including one sequence. I used bioawk and the awk modulo function (%) to create 3 cathegories of sequences: with sequence length divisible by 3, divisible by 3 with remainder 1 and with remainder 2. Sequences divisible with remainder were trimmed from the end by one respectively two characters.

#!/bin/bash
workdir=mydir

for file in ${workdir}/nt_renamed/* 
do filebasename=$(basename "$file" )
# if modulo 3 of the sequence length is 0, print the fasta header and the sequence and output it to file 
bioawk -c fastx 'length($seq) % 3 == 0  {print ">"$name; print $seq}' $file  >  ${workdir}/nt_renamed_bioawked/${filebasename}
# if modulo 3 of the sequence length is 1 or 2, print the fasta header and the sequence, trim lines which does not start with ">" from its end by one resp. 2 characters
bioawk -c fastx 'length($seq) % 3 == 1  {print ">"$name; print $seq}' $file | sed '/^>/! s/.$//'  >> ${workdir}/nt_renamed_bioawked/${filebasename} 
bioawk -c fastx 'length($seq) % 3 == 2  {print ">"$name; print $seq}' $file | sed '/^>/! s/..$//'  >> ${workdir}/nt_renamed_bioawked/${filebasename} 
done
ADD COMMENT
0
Entering edit mode

This is perfect. I will toggle this answer as accepted to provide closure to this thread.

ADD REPLY

Login before adding your answer.

Traffic: 2434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6