Hello all,
We have done some analysis on a protein domain and want to map those to a particular protein sequence. Following is the result of hmmscan
against the Pfam db. I just want to know the position numbers in domain, as well as sequence that contains the inserts/gaps (.
& -
respectively).
GGGSTTTCT.SCSSSE..EEEEEETTTTEEEEEEECSSSTS...SSSBSSHHHHHHHHS CS
domain-1 1 vCslpkdeG.pCnase..tryyynsetgtCesflyggcggN...aNnFetkeeCesaCl 53
+C++p+d+ +C+++e +r y+++ +g C+sf++ c ++ a++++++++C +aC+
protein-1 4 LCIKPRDWIdECDSNEggERAYFRNGKGGCDSFWI--CPEDhtgADYYSSYRDCFNACI 60
6********88***999******************..88884455*************5 PP
domain-1 (v-1, C-2, s-3......l-53
) and protein-1 (L-4, C-5, I-6.....I-60
).
I just need the position numbers containing .
or -
. For the given example the output is expected like
10 17 18 42 43 44 - domain-1
37 38 - protein-1
Format doesn't matter (comma separated or column like values of domain and sequence).
EDIT: I got something with grep
, but it only returning the required output when we input individual sequence.
grep -aob '\.' domain.txt | grep --color=never | \grep '[0-9]+'
Here domain.txt
contains only the sequence domain-1
. Any help to extend this to the given problem will be greatly appreciated.
The most complicated output of hmmscan will be like the following: As given problem, here I need positions of PF00962 and 1ADD:A
All the alignment information seems to be on the same line though. If that is the case, my script will work. Maybe you can give it a try? As long as there isn't any new line character breaking the alignment into multiple line, then my script will work.
FYI, for this example provided, the out put is:
Thank you. Its a very good script. If I want to write the name at the starting of position numbers as follows.
where should I change the script
This should work. I was copying your desired output before so the the script delayed the output of the protein name
Excellent. Thank you, it saves a lot of time.