How To Extract Target Domains/Sequences From Interproscan Tsv Results
1
0
Entering edit mode
10.8 years ago
chevivien ▴ 90

Hallo ...how to one extract target domains/sequences from interproscan tsv results obtain by running the interproscan v5 standalone version?i got results which looks more less like below...iam only interested in sequences with SPECIFIC DOMAINS FOR EXAMPLE NBS-ARC or LRR domains?i want to get sequences in fasta format which those particular domains....

lcl|Os11t0689100-02    2a4f1393b2213ae4aa22b96f5026095c    748    PRINTS    PR00364    Disease resistance protein signature    208    223    1.100000299E-8    T    27-01-2014
lcl|Os11t0689100-02    2a4f1393b2213ae4aa22b96f5026095c    748    PRINTS    PR00364    Disease resistance protein signature    448    462    1.100000299E-8    T    27-01-2014
lcl|Os11t0689100-02    2a4f1393b2213ae4aa22b96f5026095c    748    PRINTS    PR00364    Disease resistance protein signature    703    719    1.100000299E-8    T    27-01-2014
lcl|Os11t0689100-02    2a4f1393b2213ae4aa22b96f5026095c    748    SUPERFAMILY    SSF52058        579    734    0.0    T    27-01-2014
lcl|Os11t0689100-02    2a4f1393b2213ae4aa22b96f5026095c    748    Pfam    PF00931    NB-ARC domain    187    542    7.7E-49    T    27-01-2014
lcl|Os05t0125400-00    c5d9599c9c686de6d015f7627340955d    440    ProSitePatterns    PS00107    Protein kinases ATP-binding region signature.    92    114    -    T    27-01-2014
lcl|Os05t0125400-00    c5d9599c9c686de6d015f7627340955d    440    Gene3D    G3DSA:1.10.510.10        148    352    3.6E-44    T    27-01-2014
lcl|Os05t0125400-00    c5d9599c9c686de6d015f7627340955d    440    PANTHER    PTHR24420        1    401    0.0    T    27-01-2014
lcl|Os05t0125400-00    c5d9599c9c686de6d015f7627340955d    440    PANTHER    PTHR24420:SF341        1    401    0.0    T    27-01-2014
lcl|Os05t0125400-00    c5d9599c9c686de6d015f7627340955d    440    Gene3D    G3DSA:3.30.200.20        100    147    1.3E-27    T    27-01-2014
lcl|Os05t0125400-00    c5d9599c9c686de6d015f7627340955d    440    Pfam    PF00069    Protein kinase domain    87    351    1.8E-49    T    27-01-2014
lcl|Os05t0125400-00    c5d9599c9c686de6d015f7627340955d    440    ProSitePatterns    PS00108    Serine/Threonine protein kinases active-site signature.    208    220    -    T
target • 3.8k views
ADD COMMENT
0
Entering edit mode
10.8 years ago
Sujai Kumar ▴ 270
  1. Grep the lines in the TSV file with the search terms of interest (eg NBS-ARC)

    grep NBS-ARC interpro.tsv >NBS-ARC.tsv
    
  2. Extract the sequence ID, start, stop columns

    cut -f1,7,8 NBS-ARC.tsv >NBS-ARC.bed
    
  3. Use bedtools getfasta to extract the fasta sequences with those coordinates

    bedtools getfasta -fi proteins.fa -bed NBS-ARC.bed -fo NBS-ARC.fasta
    

Or, you can do it all in one line:

    grep NBS-ARC interpro.tsv | cut -f1,7,8 | bedtools getfasta -fi proteins.fa -bed - -fo NBS-ARC.fasta
ADD COMMENT
0
Entering edit mode

Thanks alot....worked well for me..

ADD REPLY

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6