Question

How to assign massive amount of protein to pfam using R

0

Entering edit mode

4.5 years ago

zhangchi2015012290 • 0

Hi! I'm doing a bacteria pan-genome research, which involves thousands of genomes. I'm trying to assign every protein in all the genome to pfam. I know there are tools like NCBI cdd database, but I don't know how to do scripted search, since you can only search 4000 proteins at one time on the website. I wonder if there is a R package to do this job, or any other convenient methods?

R pfam pan-genome • 1.3k views

ADD COMMENT • link updated 4.5 years ago by Jean-Karim Heriche 27k • written 4.5 years ago by zhangchi2015012290 • 0

score 1 · Accepted Answer · 2020-06-12

1

Entering edit mode

4.5 years ago

Jean-Karim Heriche 27k

There's the pfam_scan.pl perl script for this. Here is a quick tutorial.

ADD COMMENT • link 4.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thank you for your answer. This method seems too slow for me, maybe I should cluster my proteins first.

ADD REPLY • link 4.5 years ago by zhangchi2015012290 • 0

0

Entering edit mode

Have you considered parallelizing? If you cluster the sequences then you could derive a profile HMM for each cluster and use something like HHsearch to compare these to the Pfam profiles.

ADD REPLY • link 4.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I just realized that I can download pfam and COG infomation from IMG database, which have been assigned to proteins already. I haven't tried your method yet, but It sounds plausible. I'll accept your answer.

ADD REPLY • link 4.5 years ago by zhangchi2015012290 • 0