Question

Multiple fasta file : calculate % composition of amino acids

0

Entering edit mode

6.6 years ago

Biogeek ▴ 480

I've come across Expasy protein param... it's limited in the sense you can only copy/paste 1 sequence at a time.

Is there an alternative approach via command line to calculate alphabet % composition of each >sequence? I haven never ventured into Biopython. I mostly use R and command line. All help appreciated. Thanks.

amino acids composition calculate • 2.3k views

ADD COMMENT • link updated 6.6 years ago by Pierre Lindenbaum 166k • written 6.6 years ago by Biogeek ▴ 480

score 1 · Answer 1 · 2018-12-19

1

Entering edit mode

6.6 years ago

GenoMax 152k

You can use pepstats from EMBOSS. Documentation here. You will need to download EMBOSS.

ADD COMMENT • link 6.6 years ago by GenoMax 152k

score 1 · Answer 2 · 2018-12-19

using awk:

function dump(arr,n)
    {
    for(i in arr)
        {
        printf("%s %d %f\n",i,arr[i],arr[i]/n);
        }
    }
BEGIN   {}
/^>/ {dump(array,N);print;delete array;N=0.0;next;}
    {
    for(i=1;i<=length($0);i++) { array[substr($0,i,1)]++;N++;}
    }
END {
    dump(array,N);
    }

usage:

$ wget -O - -q "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=CAA68495.1,CAA64262.1,CAA46742.1&rettype=fasta"   |\
awk -f script.awk 


>CAA68495.1 unnamed protein product [Rotavirus]
N 35 0.088161
A 27 0.068010
C 3 0.007557
P 20 0.050378
Q 17 0.042821
D 19 0.047859
E 19 0.047859
R 25 0.062972
F 25 0.062972
S 28 0.070529
G 19 0.047859
T 30 0.075567
H 7 0.017632
I 27 0.068010
V 26 0.065491
W 5 0.012594
K 8 0.020151
Y 11 0.027708
L 34 0.085642
M 12 0.030227
>CAA64262.1 NSP2 [Rotavirus]
N 24 0.075710
A 17 0.053628
P 10 0.031546
C 5 0.015773
Q 11 0.034700
D 13 0.041009
R 14 0.044164
E 22 0.069401
S 20 0.063091
F 15 0.047319
G 11 0.034700
T 16 0.050473
H 10 0.031546
V 23 0.072555
I 22 0.069401
W 4 0.012618
K 29 0.091483
Y 14 0.044164
L 30 0.094637
M 7 0.022082
>CAA46742.1 viral non structural protein NS5 [Rotavirus]
N 15 0.048077
A 21 0.067308
P 9 0.028846
C 5 0.016026
Q 9 0.028846
D 22 0.070513
R 15 0.048077
E 18 0.057692
S 20 0.064103
F 14 0.044872
G 10 0.032051
T 25 0.080128
H 8 0.025641
I 17 0.054487
V 23 0.073718
W 3 0.009615
K 28 0.089744
Y 14 0.044872
L 27 0.086538
M 9 0.028846