Entering edit mode
4.5 years ago
Hi! I'm doing a bacteria pan-genome research, which involves thousands of genomes. I'm trying to assign every protein in all the genome to pfam. I know there are tools like NCBI cdd database, but I don't know how to do scripted search, since you can only search 4000 proteins at one time on the website. I wonder if there is a R package to do this job, or any other convenient methods?
Thank you for your answer. This method seems too slow for me, maybe I should cluster my proteins first.
Have you considered parallelizing? If you cluster the sequences then you could derive a profile HMM for each cluster and use something like HHsearch to compare these to the Pfam profiles.
I just realized that I can download pfam and COG infomation from IMG database, which have been assigned to proteins already. I haven't tried your method yet, but It sounds plausible. I'll accept your answer.