Hello everybody,
I am trying to calculate per-sequence GC3% of the cds I obtained from my RNA-seq assemblies .. I thought it could be a quick & trivial task, but it does not appear to be so straight-forward!
One way of doing that could be to extract every codon position from my cds multifasta files and then calculate GC% for each sequence. I tried to extract the 3rd codon positions for my cds .fasta (using the one-liner below), but it actually takes ages!
while read line; do if echo $line | grep -v ">"; then echo $line | sed "s/(..)./\1/g"; else echo $line > 3rd_postion.fasta; fi; done < cds.fasta
Any suggestion?
Thanks!
Please use dedicated tools for 3/6 frame translation such as ORF finder: https://www.ncbi.nlm.nih.gov/orffinder/ or CLI tools as mentioned here: Best command line ORF finder . If you have multifasta file and for each sequence, if you want calculate GC%, you can use tools such seqkit. fx2tab function output per sequence GC% in tabular format.
Please post example data and expected output from next time. It would help addressing the issue faster @ forni.giobbe
A blast from the past but codonw calculates per sequence GC3 and many other stats directly from the CDS fasta file http://codonw.sourceforge.net/#Downloading%20and%20Installation .
Not really an answer but the reason your command is slow is that it launches and terminates both
grep
andsed
for every single line of the .fasta file. Insed
you can add a regex to switch on the replacement, so:sed '/^>/!s/\(..\)./\1/g' cds.fasta
Where
/^>/!
means match all lines beginning with>
but then invert the match (with!
acting likegrep -v
).This only starts a single instance of
sed
and will be much faster. Still not sure it will get you the right answer for GC3%, though, especially if the FASTA file is line-wrapped at a length not a multiple of 3.