Question

Faster way to get GC content and Coverage from .fasta

0

Entering edit mode

7.2 years ago

Shred ★ 1.6k

Guys, I've got a .fasta file , with the usual scheme

>NODE XX _ lenght XX_cov.xx.xx
ACGACGACGACACTCTACAGATACGACAT
>NODE YY_lenght YY_cov.YY.YY
CGACGACATCAGCATACAGATACTA

I wrote a small script to get the GC content of each contig sequence, and so on, to convert the "cov." value expressed into the nucleotidic coverage. Here's the code

    #! /bin/bash

## 
## This script needs to do few things: 
##
## First of all, conversion of cov with nucleotidic one

# Parameters options

display_help() {
    echo "Usage: $0 [file.fasta] [option..]" >&2
    echo
    echo "  -k, --kmer  kmer last value"
    echo "  -l, --lenght    read lenght"
    echo 
    echo "  example: $0 file.fasta -k 77 -l 150"
    echo 
    exit 1
}



case "$1" in 
    -h | --help )
        display_help
        exit 0
        ;;
esac 

case "$2" in
    -k | --kmer )
        KMER=($3)
        ;;
esac 

case "$4" in            
    -l | --lenght)
        READLENGHT=($5)
        ;;
esac 



# Check if the format is correct
if [[ $1 != *.fasta ]]; then
    echo "The input file $1 it's not in fasta format!"
    echo ""    
    echo "Usage: ./nomescript.sh NO_HIT.fasta"
  exit 1
fi



# calculate the coefficient used to convert coverage. Final is Y 
# calculate GC content of each sequence 

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It actually does the work, stamping a file like:

>NODE 
nucleotide_cov:
ACGACGAATCATACGACATAA
GC_CONT: 
>NODE
nucleotide_cov
GCAGCATACATCAGATACGATAC
GC_CONT

My problem is that is extremely slow when it works against .fasta > 200Mb.

I'm looking for a way to increase its speed, or doing the same work with other kind of script. Maybe.. python?

sequencing next-gen genome bash fasta • 18k views

ADD COMMENT • link 3.3 years ago by Shred ★ 1.6k

0

Entering edit mode

A: How To Calculate A Region‘ Gc Content In Reference Genome？？

ADD REPLY • link 7.2 years ago by Alex Reynolds 36k

4

Entering edit mode

7.2 years ago

cpad0112 21k

output (with seqkit):

$ seqkit fx2tab  --name  --gc test.fa 
NODE XX _ lenght XX_cov.xx.xx           48.28
NODE YY_lenght YY_cov.YY.YY         44.00

output with infoseq from emboss: (there is a space in fasta header. Hence emboss program is output is different. If there are no spaces, then name would print entire header). If your sequence has Ns, emboss infoseq is not useful in calculating %GC.

$ infoseq test.fa  -only -desc -name -length -pgc
Display basic information about sequences
Name           Length %GC    Description 
NODE           29     48.28  XX _ lenght XX_cov.xx.xx
NODE           25     44.00  YY_lenght YY_cov.YY.YY

input:

$ cat test.fa 
>NODE XX _ lenght XX_cov.xx.xx
ACGACGACGACACTCTACAGATACGACAT
>NODE YY_lenght YY_cov.YY.YY
CGACGACATCAGCATACAGATACTA

ADD COMMENT • link 7.2 years ago by cpad0112 21k

0

Entering edit mode

Man, please explain me the reason why you wrote this answer, except to promote your sample code.

ADD REPLY • link 7.2 years ago by Shred ★ 1.6k

1

Entering edit mode

When you ask a question it can have multiple answers and all of them can be correct. There is no one (or only) way of doing things in bioinformatics. This just happens to be another option for someone else looking for a similar need in future.

ADD REPLY • link 7.2 years ago by GenoMax 151k

1

Entering edit mode

7.2 years ago

Selenocysteine ▴ 620

I can suggest you this script from one of my colleagues ;) https://github.com/oXis/gtool

ADD COMMENT • link 7.2 years ago by Selenocysteine ▴ 620

1

Entering edit mode

4.5 years ago

mendozamejian ▴ 10

Hello,

Based on Shared's answer, using awk you can do:

awk '!/^>/{gc+=gsub(/[gGcC]/,""); at+=gsub(/[aAtT]/,"");} END{ printf "%.2f%%\n", (gc*100)/(gc+at) }' file.fasta

I recommend checking out biobash github there is a script that does this. It is a tool set for common bioinformatics tasks we are working on, and it is written in bash.

ADD COMMENT • link 4.5 years ago by mendozamejian ▴ 10

1

Entering edit mode

Now that this has been revisited, here's a perl one-liner that would print the overall GC content too:

perl -lane 'unless (/^>/) { $l += length(); $gc++ while /[GC]/ig } END { print $gc/$l}' file.fasta

If GC content by contig is required, then a little bit of code is needed to print GC content after reading each one:

perl -lane 'if (/^>(.+)/) {
 printf "%s %d %.2f\n", $n, $l, $gc/$l if $l;
 $n = $1; $l = 0; $gc = 0;
} else {
 $l += length($_);
 $gc++ while /[GC]/ig;
} END {
 printf "%s %d %.2f\n", $n, $l, $gc/$l if $l;
}' file.fasta

ADD REPLY • link 4.5 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks for writing something sensible.

ADD REPLY • link 4.2 years ago by ryandward • 0

score 3 · Accepted Answer · 2018-03-13

3

Entering edit mode

7.2 years ago

Shred ★ 1.6k

I've found the solution by using just awk, instead of creating a while read loop. Here's the work-around:

awk -F '_' -v Y="$Y" '{ if(NR%2==1) {
    printf "%s %s %s %s %s\nnucleotidic_cov : %.4f\n",$1,$2,$3,$4,$5, ($6 / Y)
} else {
    x=gsub(/[AT]/,""); 
    z=gsub(/[GC]/,""); 
    printf "GC_CONT : %.2f%%\n", (z*100)/(x+z)
    }
 }' large_file

Explanation: 1) " NR%2==1 " is used to use just odd lines. By using underscore as a field separator, it's easy to call each part of the line, by using $1, $2 , etc.. "%s" is used, as expected, to keep parts of interest into each odd line. 2) "x" and "z" are variables defined into the awk subshell: they define the number of match of both the letters inside parenthesis. I think this is the faster way (absolutely, also in comparison with other language) to count the GC content and the AT content of a sequence.

Credits to @amisax, from unix.stackexchange.com

ADD COMMENT • link 3.3 years ago by Shred ★ 1.6k

0

Entering edit mode

I tried running the above command, but I am getting this error

"awk: cmd. line:2: (FILENAME=10s.fasta FNR=1) fatal: division by zero attempted"

ADD REPLY • link 5.8 years ago by singh.jyotika ▴ 10

0

Entering edit mode

Probably because the fasta file used by the author have a very sepcific formatting of the tilte line.

ADD REPLY • link 5.6 years ago by Kristoffer Vitting-Seerup ★ 4.2k