Map a protein domain information to a sequence
2
0
Entering edit mode
9.5 years ago
venu 7.1k

Hello all,

We have done some analysis on a protein domain and want to map those to a particular protein sequence. Following is the result of hmmscan against the Pfam db. I just want to know the position numbers in domain, as well as sequence that contains the inserts/gaps (. & - respectively).

             GGGSTTTCT.SCSSSE..EEEEEETTTTEEEEEEECSSSTS...SSSBSSHHHHHHHHS CS
 domain-1  1 vCslpkdeG.pCnase..tryyynsetgtCesflyggcggN...aNnFetkeeCesaCl 53
             +C++p+d+  +C+++e  +r y+++ +g C+sf++  c ++   a++++++++C +aC+
 protein-1 4 LCIKPRDWIdECDSNEggERAYFRNGKGGCDSFWI--CPEDhtgADYYSSYRDCFNACI 60
             6********88***999******************..88884455*************5 PP

domain-1 (v-1, C-2, s-3......l-53) and protein-1 (L-4, C-5, I-6.....I-60).

I just need the position numbers containing . or -. For the given example the output is expected like

10 17 18 42 43 44 - domain-1
37 38 - protein-1

Format doesn't matter (comma separated or column like values of domain and sequence).

EDIT: I got something with grep, but it only returning the required output when we input individual sequence.

grep -aob '\.' domain.txt | grep --color=never | \grep '[0-9]+'

Here domain.txt contains only the sequence domain-1. Any help to extend this to the given problem will be greatly appreciated.

pfam hmmscan • 2.7k views
ADD COMMENT
2
Entering edit mode
9.5 years ago
Sam ★ 4.8k

Are you sure for protein-1, it is 37 38?

If not then the following code should work (It will give you 36, 37 for your protein-1 input):

awk '{
  if(NF==4) {
    i=0;
    prev = 0;
    while(index(substr($3,i),".")!=0 && I < length($3)) {
      j=index(substr($3,i),".");
      printf j+prev" ";
      i = j+1+prev;
      prev = prev+j;
    }
    i=0;
    prev = 0;
    while(index(substr($3,i),"-")!=0 && I < length($3)) {
      j=index(substr($3,i),"-");
      printf j+prev" ";
      i = j+1+prev;
      prev = prev+j;
    }
    printf " - "$1"\n";
  }
}' domain.txt

A problem with the code though is that it doesn't differentiate different lines. I might need an example of multi-line input to write a small program to deal with that.

ADD COMMENT
0
Entering edit mode

The most complicated output of hmmscan will be like the following: As given problem, here I need positions of PF00962 and 1ADD:A

# hmmscan :: search sequence(s) against a profile database
# HMMER 3.0 (March 2010); http://hmmer.org/
# Copyright (C) 2010 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query sequence file:             2fasta.txt
# target HMM database:             Pfam-A.hmm
# output directed to file:         57_scan
# prefer accessions over names:    yes
# max ASCII text line length:      unlimited
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       1ADD:A  [L=349]
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Model    Description
    ------- ------ -----    ------- ------ -----   ---- --  -------- -----------
   3.5e-106  354.8   0.0   3.9e-106  354.6   0.0    1.0  1  PF00962  Adenosine/AMP deaminase

Domain annotation for each model (and alignments):
>> PF00962  Adenosine/AMP deaminase
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !  354.6   0.0  2.7e-110  3.9e-106       1     330 [.       5     342 ..       5     343 .. 0.99

  Alignments for each domain:
  == domain 1    score: 354.6 bits;  conditional E-value: 2.7e-110
              TS-EEEEEEECCCSS-HHHHHHHHHHHCHT--SS-SCCCCCCCCHCHCCHHHHCCHCCCHHCCCHHHTS-ECCCCCHHHHHHCCCCCHCTTCCCCCCCCSHHHHH..........HH-GCSTCCHHHHCCCCCCCCCCHCCT.EEEEEECCETT.TCCGHCCCHHHHHHCCCCCCCCCCCCC.TTSTTGGG...HHHHHCCCCHTT-EEEEEESSSS-CHHHHHHHHCS.-SCECE-GGGGGSHHHHHHHHHCT-EEEE-HHHHCCCSST-TTTT-CHHHHHHTT-EEEE--BSCCCCT-SHHHHHHHHHHHHT--HHHHHHHHHHHHHCSSS-HHHHHHHCH CS
  PF00962   1 nlpkadlHvHlsalmnpktllrlakkkgikeldekvvkeveknltlkekfdslkefledlsidvlvvhaerlaqavrryavealeelaadgvvylevrfdplfea..........tlegqlpelhvvlkvkdgfdevereskigaklilskirkkpeewleevaelakkyrdytvanldllgdEekepssl...flklraeagketlkltpHaGEaggaesvveallllgaerigHGialakdprllyllaerqipievcPlSNvaLgavaeyaehPlkeflaaglpvslstDDplqfgatlseeYaiaaqvlklseedlkelarNsvksSglsdeeKaalla 330
              n+pk++lHvHl+++++p+t+l+++kk+gi+   ++++ e+ +n++ ++k+ sl  fl+++++++ v++   +++a++r+a+e++e++a++gvvy+evr++p+++a          + eg+++ ++vv+ v++g++e+e++++i++++il+++r+ +++w+ ev el+kky ++tv+++dl+gdE++e+ssl   ++++++ a+k+++++t+HaGE+g++e+v+ea+++l++er+gHG+++++d+ l+++l ++++++evcP+S++ +ga+   + h++++f +++ ++sl+tDDpl+f++tl+++Y++++++++++ee++k+l++N+ ksS+l++eeK++ll+
   1ADD:A   5 NKPKVELHVHLDGAIKPETILYFGKKRGIA--LPADTVEELRNIIGMDKPLSLPGFLAKFDYYMPVIA--GCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLAnskvdpmpwnQTEGDVTPDDVVDLVNQGLQEGEQAFGIKVRSILCCMRH-QPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLfpgHVEAYEGAVKNGIHRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHAVVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLE 342
              78**************************99..888899999***************************..*********************************************99999**********************************.*****************************************************************************************************************************************************************************************997 PP


Internal pipeline statistics summary:
-------------------------------------
Query sequence(s):                         1  (349 residues)
Target model(s):                       14831  (2610332 nodes)
Passed MSV filter:                       347  (0.0233969); expected 296.6 (0.02)
Passed bias filter:                      298  (0.020093); expected 296.6 (0.02)
Passed Vit filter:                        16  (0.00107882); expected 14.8 (0.001)
Passed Fwd filter:                         1  (6.74263e-05); expected 0.1 (1e-05)
Initial search space (Z):              14831  [actual number of targets]
Domain search space  (domZ):               1  [number of targets reported over threshold]
# CPU time: 0.25u 0.09s 00:00:00.34 Elapsed: 00:00:00.13
# Mc/sec: 7007.74
//
ADD REPLY
1
Entering edit mode

All the alignment information seems to be on the same line though. If that is the case, my script will work. Maybe you can give it a try? As long as there isn't any new line character breaking the alignment into multiple line, then my script will work.

FYI, for this example provided, the out put is:

106 107 108 109 110 111 112 113 114 115 192 193 194  - PF00962
31 32 69 70 155  - 1ADD:A
ADD REPLY
0
Entering edit mode

Thank you. Its a very good script. If I want to write the name at the starting of position numbers as follows.

PF00962 - 106 107 108 109 110 111 112 113 114 115 192 193 194
1ADD:A - 31 32 69 70 155

where should I change the script

ADD REPLY
1
Entering edit mode
awk ' { if(NF==4){i=0;prev = 0;
  printf $1" - ";
  while(index(substr($3,i),".")!=0 && I < length($3)) {
    j=index(substr($3,i),".");
    printf j+prev" ";
    i = j+1+prev;
    prev = prev+j;
  }
  i=0; prev = 0;
  while(index(substr($3,i),"-")!=0 && I < length($3)){
    j=index(substr($3,i),"-");
    printf j+prev" ";
    i = j+1+prev;
    prev = prev+j;
  }

  printf "\n" 
}
}' domain.txt

This should work. I was copying your desired output before so the the script delayed the output of the protein name

ADD REPLY
0
Entering edit mode

Excellent. Thank you, it saves a lot of time.

ADD REPLY
0
Entering edit mode
9.5 years ago
alolex ▴ 960

I've found the following loop very helpful in bash scripting. Put all your sequences in one file, one per line, and play around with this:

cat input.txt | while read line
do
  ##process $line to get desired output, write to a file using >> to append and not overwrite content.  
done
ADD COMMENT
0
Entering edit mode

It loops on every line but I only need lines domain-1 and protein-1. Differentiating the output will be very difficult if we loop on every line (and sometimes line length goes more than one line). If I get the ids of lines (like domain-1 and protein-1) it would be more easy.

ADD REPLY

Login before adding your answer.

Traffic: 2796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6