I have a question how to search for and download the protein information I need from Interpro (https://www.ebi.ac.uk/interpro/)
First, I want to obtain information on all proteins containing RRM.
Second, I also want to know all other domain information for those proteins, excluding RRM.
To obtain the above information, the method I used is as follows: I searched for proteins with RRM by using CATH G3DSAT:3.30.70.330 (RRM no.) or Interpro IPR000504. I downloaded .fasta file and .tsv file from the link. https://www.ebi.ac.uk:443/interpro/wwwapi//protein/reviewed/entry/cathgene3d/G3DSA:3.30.70.330/?extra_fields=counters
The partial contents of RRM.fasta and RRM.tsv are as follows.
RRM.fasta
>A0A0A0LLY1|reviewed|Small RNA binding protein 1|taxID:3659
MASSSVEFRCFVGGLAWATDSNSLEKAFSVYGEIVEAKIVSDRETGRSRGFGFVTFLEEEAMRSAIEAMNGHILDGRNIT
VNEAQQRGGGGGGGYNRGGGYGGRRDGGGFSRGGGGGYGGGGGGGYGGGRDRGYGGGGGYGGGRDSRGSGGGGSEGGWRN
>A0A0D1C8Z4|reviewed|Glycine-rich RNA-binding protein 1|taxID:237631
MAAKVYVGNLSWNTTDDSLAHAFSTYGQLTDYIVMKDRETGRSRGFGFVTFATQAEADAAIAALNEQELDGRRIRVNMAN
SRPAGGMGGGYGGVTGQYGANAYGAQGGYGGYGGQPGGFQQPGGFQQQGGYPQQGGYGGYQQPGFQPQQGGYGAPQQGYG
APQQGGYGGYNGQSQ
>A0A0D1DWZ5|reviewed|RNA-binding protein RRM4|taxID:237631
MSDSIYAPHNKHKLEAARAADAAADDAATVSALVEPTDSTAQASHAAEQTIDAHQQAGDVEPERCHPHLTRPLLYLSGVD
ATMTDKELAGLVFDQVLPVRLKIDRTVGEGQTASGTVEFQTLDKAEKAYATVRPPIQLRINQDASIREPHPSAKPRLVKQ
LPPTSDDAFVYDLFRPFGPLRRAQCLLTNPAGIHTGFKGMAVLEFYSEQDAQRAESEMHCSEVGGKSISVAIDTATRKVS
AAAAEFRPSAAAFVPAGSMSPSAPSFDPYPAGSRSVSTGSAASIYATSGAAPTHDTRNGAQKGARVPLQYSSQASTYVDP
CNLFIKNLDPNMESNDLFDTFKRFGHIVSARVMRDDNGKSREFGFVSFTTPDEAQQALQAMDNAKLGTKKIIVRLHEPKT
MRQEKLAARYNAANADNSDMSSNSPPTEARKADKRQSRSYFKAGVPSDASGLVDEEQLRSLSTVVRNELLSGEFTRRIPK
VSSVTEAQLDDVVGELLSLKLADAVEALNNPISLIQRISDAREQLAQKSASTLTAPSPAPLSAEHPAMLGIQAQRSVSSA
SSTGEGGASVKERERLLKAVISVTESGAPVEDITDMIASLPKKDRALALFNPEFLKQKVDEAKDILDITDESGEDLSPPR
ASSGSAPVPLSVQTPASAIFKDASNGQSSISPGAAEAYTLSTLAALPAAEIVRLANSQSSSGLPLPKADPATVKATDDFI
DSLQGKAAHDQKQKLGDQLFKKIRTFGVKGAPKLTIHLLDSEDLRALAHLMNSYEDVLKEKVQHKVAAGLNK
RRM.tsv
Accession Source Database Name Tax ID Tax Name Length Entry Accession Matches
A0A0A0LLY1 reviewed Small RNA binding protein 1 3659 Cucumis sativus 160 G3DSA:3.30.70.330 2..113
A0A0D1C8Z4 reviewed Glycine-rich RNA-binding protein 1 237631 Ustilago maydis (strain 521 / FGSC 9021) 175 G3DSA:3.30.70.330 1..98
A0A0D1DWZ5 reviewed RNA-binding protein RRM4 237631 Ustilago maydis (strain 521 / FGSC 9021) 792 G3DSA:3.30.70.330 153..251,290..411
Using this approach, I could only achieve my first goal.
To accomplish the second goal, I searched each accession code on InterPro.
A0A0D1DWZ5 has not only the RRM but also a poly A binding domain (CATH 1.10.1900.10).
Therefore, I searched A0A0D1DWZ5 on InterPro and downloaded the corresponding TSV file.
The partial contents of A0A0D1DWZ5.tsv is as follows.
A0A0D1DWZ5.tsv
Accession Name Source Database Type Integrated Into Integrated Signatures GO Terms Protein Accession Protein Length Matches
cd00590 RNA recognition motif (RRM) superfamily cdd domain a0a0d1dwz5 792 158..230
G3DSA:1.10.1900.10 c-terminal domain of poly(a) binding protein cathgene3d homologous_superfamily a0a0d1dwz5 792 573..627,712..791
G3DSA:3.30.70.330 cathgene3d homologous_superfamily IPR012677 a0a0d1dwz5 792 153..251,290..411
And now I got 4 domains matches from A0A0D1DWZ5.tsv
GSDSA:1.10.1900.10 573..627,712..791
G3DSA:3.30.70.330 153..251,290..411
However, manually searching and downloading TSV files for each accession code on InterPro is too time-consuming and labor-intensive.
Is there a more efficient way to achieve the second goal?