Count of GC in row, but not from N bases
2
1
Entering edit mode
9.5 years ago
Korsocius ▴ 260

Dear all,

I have one problem, it is only one condition for my easy script for counting of GC content per row

awk 'NR>1{n=length($1); gc=gsub("[gcGC]", "", $1); print gc/n}' $i

How to count length of row without N character.

for example:

Input:  ACAGCTTGCNNNN   => length= 9   Gc content=5/9

format of output is not important, only how to count it.

Thanks a lot

GC • 3.7k views
ADD COMMENT
0
Entering edit mode

I think I could do it by

N_count = awk -F N  '{print NF-1}' file

and this result use in

awk 'NR>1{n=length($1); gc=gsub("[gcGC]", "", $1); print gc/(n-$N_count}' $i
ADD REPLY
2
Entering edit mode
9.5 years ago
iraun 6.2k

This command should work, but it's not in awk, it is in bash.

while read p; do
       len=$(echo $p | sed 's/N//g' | tr -d '\n' | wc -c)
       cnt=$(echo $p | grep -oh 'C\|G\|g\|c' | tr -d '\n' | wc -c)
       gc=$(awk "BEGIN {printf \"%.2f\",${cnt}/${len}}")
       echo -e length:$len --- GC:$gc
done<file
ADD COMMENT
0
Entering edit mode

Bash is good too, I solve it in awk with bash together. This result is comfortable but only one thing, there is only rounded to hundredths .Thank you..

ADD REPLY
0
Entering edit mode

Glad to help :).

ADD REPLY
2
Entering edit mode
9.5 years ago
tomc ▴ 90

Normalized GC content per row sans N

awk '{gsub("N","");t=length();gsub(/[GC]/,"");print int((t-length())/t*100)/100}'

or, assumes sequence symbols are strictly ACTGN.

awk '{gsub("N","");t=length();gsub(/[AT]/,"");print int(length()/t*100)/100}'
ADD COMMENT

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6