Question

Finding Proteins that have NO known Domains

1

Entering edit mode

10.5 years ago

ddofer ▴ 30

I want to extract a list of proteins for a given organism that have no (high confidence) predicted domains by PFAM or the like.

(Alternatively, getting a list of predictions for a list of proteins would also be good).

I know HMMER and Pfam and the like (CCD-Hit) have various tools for searching for domains, but I don't know how to work with the emailed file outputs, and I'm specifically interested in just finding which proteins DON'T have predicted domains.

Is there an easy/simple way to do this? (Even a tool with output that I can copy-paste into a text editor/excel and then filter the columns in it..)?

Thanks!

sequence pfam domain batch protein • 3.0k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.5 years ago by ddofer ▴ 30

0

Entering edit mode

what is emailed file output? I think, after you blast against a domain database, all those sequences with no hits are considered as sequences without domains. Am I missing something?

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

I was working then with the HMMER and/or PFAM search results, which are returned as a plaintext email. Yuch.

That said, even with the offline tool, I don't know how to parse the command line output text properly, it just prints it onscreen.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by ddofer ▴ 30

Ram · Answer 1 · 2014-06-12

2

Entering edit mode

10.5 years ago

Elisabeth Gasteiger ★ 2.4k

You could query the UniProt Knowledgebase for proteins with no cross-references to InterPro,

active:yes not database:interpro

http://www.uniprot.org/uniprot/?query=+active%3Ayes+not+database%3Ainterpro&sort=score

ADD COMMENT • link 10.5 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Interpro has many annotations though, not just domains...

(And I'm wokring on offline sequences which aren't necessarily in Uniprot; or even NCBI.

As for your approach on a database, Wouldn't i make more sense to just search for proteins with "NOT domain:*" ? Your query has proteins with annotated domains right on the first page of results :P)

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by ddofer ▴ 30