Question

Map a protein domain information to a sequence

0

Entering edit mode

10.1 years ago

venu 7.1k

Hello all,

We have done some analysis on a protein domain and want to map those to a particular protein sequence. Following is the result of hmmscan against the Pfam db. I just want to know the position numbers in domain, as well as sequence that contains the inserts/gaps (. & - respectively).

             GGGSTTTCT.SCSSSE..EEEEEETTTTEEEEEEECSSSTS...SSSBSSHHHHHHHHS CS
 domain-1  1 vCslpkdeG.pCnase..tryyynsetgtCesflyggcggN...aNnFetkeeCesaCl 53
             +C++p+d+  +C+++e  +r y+++ +g C+sf++  c ++   a++++++++C +aC+
 protein-1 4 LCIKPRDWIdECDSNEggERAYFRNGKGGCDSFWI--CPEDhtgADYYSSYRDCFNACI 60
             6********88***999******************..88884455*************5 PP

domain-1 (v-1, C-2, s-3......l-53) and protein-1 (L-4, C-5, I-6.....I-60).

I just need the position numbers containing . or -. For the given example the output is expected like

10 17 18 42 43 44 - domain-1
37 38 - protein-1

Format doesn't matter (comma separated or column like values of domain and sequence).

EDIT: I got something with grep, but it only returning the required output when we input individual sequence.

grep -aob '\.' domain.txt | grep --color=never | \grep '[0-9]+'

Here domain.txt contains only the sequence domain-1. Any help to extend this to the given problem will be greatly appreciated.

pfam hmmscan • 3.1k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by venu 7.1k

0

Entering edit mode

10.1 years ago

alolex ▴ 960

I've found the following loop very helpful in bash scripting. Put all your sequences in one file, one per line, and play around with this:

cat input.txt | while read line
do
  ##process $line to get desired output, write to a file using >> to append and not overwrite content.  
done

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by alolex ▴ 960

0

Entering edit mode

It loops on every line but I only need lines domain-1 and protein-1. Differentiating the output will be very difficult if we loop on every line (and sometimes line length goes more than one line). If I get the ids of lines (like domain-1 and protein-1) it would be more easy.

ADD REPLY • link 10.1 years ago by venu 7.1k

Ram · Accepted Answer · 2015-06-01

2

Entering edit mode

10.1 years ago

Sam ★ 4.8k

Are you sure for protein-1, it is 37 38?

If not then the following code should work (It will give you 36, 37 for your protein-1 input):

awk '{
  if(NF==4) {
    i=0;
    prev = 0;
    while(index(substr($3,i),".")!=0 && I < length($3)) {
      j=index(substr($3,i),".");
      printf j+prev" ";
      i = j+1+prev;
      prev = prev+j;
    }
    i=0;
    prev = 0;
    while(index(substr($3,i),"-")!=0 && I < length($3)) {
      j=index(substr($3,i),"-");
      printf j+prev" ";
      i = j+1+prev;
      prev = prev+j;
    }
    printf " - "$1"\n";
  }
}' domain.txt

A problem with the code though is that it doesn't differentiate different lines. I might need an example of multi-line input to write a small program to deal with that.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Sam ★ 4.8k

0

Entering edit mode

The most complicated output of hmmscan will be like the following: As given problem, here I need positions of PF00962 and 1ADD:A

# hmmscan :: search sequence(s) against a profile database
# HMMER 3.0 (March 2010); http://hmmer.org/
# Copyright (C) 2010 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query sequence file:             2fasta.txt
# target HMM database:             Pfam-A.hmm
# output directed to file:         57_scan
# prefer accessions over names:    yes
# max ASCII text line length:      unlimited
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       1ADD:A  [L=349]
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Model    Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------
   3.5e-106  354.8   0.0   3.9e-106  354.6   0.0    1.0  1  PF00962  Adenosine/AMP deaminase

Domain annotation for each model (and alignments):
>> PF00962  Adenosine/AMP deaminase
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !  354.6   0.0  2.7e-110  3.9e-106       1     330 [.       5     342 ..       5     343 .. 0.99

  Alignments for each domain:
  == domain 1    score: 354.6 bits;  conditional E-value: 2.7e-110
              TS-EEEEEEECCCSS-HHHHHHHHHHHCHT--SS-SCCCCCCCCHCHCCHHHHCCHCCCHHCCCHHHTS-ECCCCCHHHHHHCCCCCHCTTCCCCCCCCSHHHHH..........HH-GCSTCCHHHHCCCCCCCCCCHCCT.EEEEEECCETT.TCCGHCCCHHHHHHCCCCCCCCCCCCC.TTSTTGGG...HHHHHCCCCHTT-EEEEEESSSS-CHHHHHHHHCS.-SCECE-GGGGGSHHHHHHHHHCT-EEEE-HHHHCCCSST-TTTT-CHHHHHHTT-EEEE--BSCCCCT-SHHHHHHHHHHHHT--HHHHHHHHHHHHHCSSS-HHHHHHHCH CS
  PF00962   1 nlpkadlHvHlsalmnpktllrlakkkgikeldekvvkeveknltlkekfdslkefledlsidvlvvhaerlaqavrryavealeelaadgvvylevrfdplfea..........tlegqlpelhvvlkvkdgfdevereskigaklilskirkkpeewleevaelakkyrdytvanldllgdEekepssl...flklraeagketlkltpHaGEaggaesvveallllgaerigHGialakdprllyllaerqipievcPlSNvaLgavaeyaehPlkeflaaglpvslstDDplqfgatlseeYaiaaqvlklseedlkelarNsvksSglsdeeKaalla 330
              n+pk++lHvHl+++++p+t+l+++kk+gi+   ++++ e+ +n++ ++k+ sl  fl+++++++ v++   +++a++r+a+e++e++a++gvvy+evr++p+++a          + eg+++ ++vv+ v++g++e+e++++i++++il+++r+ +++w+ ev el+kky ++tv+++dl+gdE++e+ssl   ++++++ a+k+++++t+HaGE+g++e+v+ea+++l++er+gHG+++++d+ l+++l ++++++evcP+S++ +ga+   + h++++f +++ ++sl+tDDpl+f++tl+++Y++++++++++ee++k+l++N+ ksS+l++eeK++ll+
   1ADD:A   5 NKPKVELHVHLDGAIKPETILYFGKKRGIA--LPADTVEELRNIIGMDKPLSLPGFLAKFDYYMPVIA--GCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLAnskvdpmpwnQTEGDVTPDDVVDLVNQGLQEGEQAFGIKVRSILCCMRH-QPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLfpgHVEAYEGAVKNGIHRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHAVVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLE 342
              78**************************99..888899999***************************..*********************************************99999**********************************.*****************************************************************************************************************************************************************************************997 PP


Internal pipeline statistics summary:
-------------------------------------
Query sequence(s):                         1  (349 residues)
Target model(s):                       14831  (2610332 nodes)
Passed MSV filter:                       347  (0.0233969); expected 296.6 (0.02)
Passed bias filter:                      298  (0.020093); expected 296.6 (0.02)
Passed Vit filter:                        16  (0.00107882); expected 14.8 (0.001)
Passed Fwd filter:                         1  (6.74263e-05); expected 0.1 (1e-05)
Initial search space (Z):              14831  [actual number of targets]
Domain search space  (domZ):               1  [number of targets reported over threshold]
# CPU time: 0.25u 0.09s 00:00:00.34 Elapsed: 00:00:00.13
# Mc/sec: 7007.74
//

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by venu 7.1k

1

Entering edit mode

All the alignment information seems to be on the same line though. If that is the case, my script will work. Maybe you can give it a try? As long as there isn't any new line character breaking the alignment into multiple line, then my script will work.

FYI, for this example provided, the out put is:

106 107 108 109 110 111 112 113 114 115 192 193 194  - PF00962
31 32 69 70 155  - 1ADD:A

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Sam ★ 4.8k

0

Entering edit mode

Thank you. Its a very good script. If I want to write the name at the starting of position numbers as follows.

PF00962 - 106 107 108 109 110 111 112 113 114 115 192 193 194
1ADD:A - 31 32 69 70 155

where should I change the script

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by venu 7.1k

1

Entering edit mode

awk ' { if(NF==4){i=0;prev = 0;
  printf $1" - ";
  while(index(substr($3,i),".")!=0 && I < length($3)) {
    j=index(substr($3,i),".");
    printf j+prev" ";
    i = j+1+prev;
    prev = prev+j;
  }
  i=0; prev = 0;
  while(index(substr($3,i),"-")!=0 && I < length($3)){
    j=index(substr($3,i),"-");
    printf j+prev" ";
    i = j+1+prev;
    prev = prev+j;
  }

  printf "\n" 
}
}' domain.txt

This should work. I was copying your desired output before so the the script delayed the output of the protein name

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by Sam ★ 4.8k

0

Entering edit mode

Excellent. Thank you, it saves a lot of time.

ADD REPLY • link 10.1 years ago by venu 7.1k