Guys, I've got a .fasta file , with the usual scheme
>NODE XX _ lenght XX_cov.xx.xx
ACGACGACGACACTCTACAGATACGACAT
>NODE YY_lenght YY_cov.YY.YY
CGACGACATCAGCATACAGATACTA
I wrote a small script to get the GC content of each contig sequence, and so on, to convert the "cov." value expressed into the nucleotidic coverage. Here's the code
#! /bin/bash
##
## This script needs to do few things:
##
## First of all, conversion of cov with nucleotidic one
# Parameters options
display_help() {
echo "Usage: $0 [file.fasta] [option..]" >&2
echo
echo " -k, --kmer kmer last value"
echo " -l, --lenght read lenght"
echo
echo " example: $0 file.fasta -k 77 -l 150"
echo
exit 1
}
case "$1" in
-h | --help )
display_help
exit 0
;;
esac
case "$2" in
-k | --kmer )
KMER=($3)
;;
esac
case "$4" in
-l | --lenght)
READLENGHT=($5)
;;
esac
# Check if the format is correct
if [[ $1 != *.fasta ]]; then
echo "The input file $1 it's not in fasta format!"
echo ""
echo "Usage: ./nomescript.sh NO_HIT.fasta"
exit 1
fi
# calculate the coefficient used to convert coverage. Final is Y
# calculate GC content of each sequence
K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)
while read odd; do
echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;${odd##*_} / $Y" | bc
read even
echo "${even##}" &&
ACOUNT=$(echo "${even##}" | sed -e "s/./&\n /g" | grep -c "A")
GCOUNT=$(echo "${even##}" | sed -e "s/./&\n /g" | grep -c "G")
CCOUNT=$(echo "${even##}" | sed -e "s/./&\n /g" | grep -c "C")
TCOUNT=$(echo "${even##}" | sed -e "s/./&\n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"
It actually does the work, stamping a file like:
>NODE
nucleotide_cov:
ACGACGAATCATACGACATAA
GC_CONT:
>NODE
nucleotide_cov
GCAGCATACATCAGATACGATAC
GC_CONT
My problem is that is extremely slow when it works against .fasta > 200Mb.
I'm looking for a way to increase its speed, or doing the same work with other kind of script. Maybe.. python?
A: How To Calculate A Region‘ Gc Content In Reference Genome??