Finding Single Domain Proteins
3
3
Entering edit mode
13.2 years ago
Fernando ▴ 30

Hi,

I am just begining to find my way through protein science--I have a question I want a list of all Single domain proteins in the PDB, I am not sure if there is a list like that?

I tried to play with both CATH/SCOP but I am not getting anywhere, is there a list someone has of all the single domain proteins, does not matter if it is all alpha or mixed, just need a list of them

What I mean is, lets say I define a domain as defied by SCOP (or CATH), I just want a list of single domain proteins Thanks, Fernando

domain • 4.5k views
ADD COMMENT
1
Entering edit mode

Can you clarify what you mean by single-domain proteins? (It might help to state what your research question is.)

  1. Peptides which only have 1 functional domain, ignoring overlaps. These would be identifiable by Pfam or RPS-BLAST search against the PDBAA sequence database for domain architecture.

  2. 3D structures that show only 1 domain, ignoring small ligands.

  3. Something else?

ADD REPLY
0
Entering edit mode

For example lysozyme is a single domain protein, so I define a single domain as something that cannot be further divided (unlike hemoglobin which has 4 domains)

Actually, is there a way to find all small globular proteins? These are usually sinegle domains (~ 100-150 residue)??

I am sorry if these questions sound trivial !

I am looking to compare different small globular proteins structures. (Not using RMSD or FASTA just visul comparision using pymol)

ADD REPLY
4
Entering edit mode
13.2 years ago

The following java program scans uniprot and search for the entries having an entry in PDB and having one and only one entry in prosite:

(firts generate the XML unmarshaller with:

 xjc -d . "http://www.uniprot.org/docs/uniprot.xsd"

then compile (javac Biostar14046.java) and run ( java Biostar14046) the following program:

import java.net.URL;
import java.util.zip.GZIPInputStream;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.XMLEvent;

import org.uniprot.uniprot.DbReferenceType;
import org.uniprot.uniprot.Entry;

public class Biostar14046
    {
    void run() throws Exception
        {
        JAXBContext jc = JAXBContext.newInstance("org.uniprot.uniprot");
        Unmarshaller u=jc.createUnmarshaller();
        XMLInputFactory factory = XMLInputFactory.newInstance();
        factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
        factory.setProperty(XMLInputFactory.IS_VALIDATING, Boolean.FALSE);
        factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
        factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE);
        XMLEventReader r= factory.createXMLEventReader(new GZIPInputStream(new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz").openStream()));
        int i=0;
        while(r.hasNext())
            {
            XMLEvent evt=r.peek();
            if(!(evt.isStartElement() && evt.asStartElement().getName().getLocalPart().equals("entry")))
                {
                r.next();
                continue;
                }
            QName qName=evt.asStartElement().getName();
            Entry entry=(Entry)u.unmarshal(r);
            int countprosite=0;
            String pdb=null;
            for(DbReferenceType ref:entry.getDbReference())
                {
                if(ref.getType().equals("PDB") && ref.getId()!=null)
                    {
                    pdb=ref.getId();
                    }
                else if(ref.getType().equals("PROSITE"))
                    {
                    countprosite++;
                    }
                }
            if(countprosite!=1 || pdb==null) continue;

            System.out.println(entry.getAccession()+"\t"+pdb);
            }
        }
    public static void main(String[] args) throws Exception
        {
        new Biostar14046().run();
        }
    }

Result:

[Q58097]    2Z61
[P49777, Q9URU7]    1IUF
[P02718]    1OLK
[Q08AH3, B3KTT9, O75202]    3GPC
[P26276]    3C04
[Q9ZCD3]    3MX6
[O35381, P97437]    2JQD
[Q9NQW6, Q5CZ78, Q6NSK5, Q9H8Y4, Q9NVN9, Q9NVP0]    2Y7B
[O43747, O75709, O75842, Q9UG09, Q9Y3U4]    1IU1
[P53068, D6VV95]    1GQP
[P07741, Q3KP55, Q68DF9]    1ZN9
[O50202]    2WFW
[P63590, Q48ZH6, Q9A0E5]    2OCZ
[P0AC38, P04422, P78140, Q2M6G5]    1JSW
[P0ABB8, P39168, Q2M665]    3GWI
[P33447]    1BW0
[P56547]    1RKR
[Q9X108]    1UP7
[P52664]    1HZO
[P0C2P0, P78986, Q0CGS9]    2Z3J
[P14315]    3LK4
[P57730, A2RRF8]    1DGN
[A5JTM5]    1NZY
[Q28960]    1N5D
[P80075, A0AV77, P78388]    1ESR
[P18181, Q545K2]    2PTV
[P31997, O60399, Q16574]    2DKS
[P30429, Q5BHI5]    3LQR
[P36222, B2R7B0, P30923, Q8IVA4, Q96HI7]    1NWU
[Q5PXQ6]    1TMX
[P01524]    1GIB
[Q96LI5, Q9UF92]    3NGQ
[Q9DBL7, A2BFA8, Q3TVZ2, Q8K3Y4]    2F6R
[P49347]    1CNV
[P02526, A2TJU8]    4GCR
[P32081, P41017, Q45690]    2I5M
[P01443]    1KBT
[Q6F495, Q3MV17]    2D04
(...)
ADD COMMENT
0
Entering edit mode
xjc -d . "http://www.uniprot.org/support/docs/uniprot.xsd"

When I run this it is showing error like

[ERROR] schema_reference.4: Failed to read schema document 'http://www.uniprot.org/support/docs/uniprot.xsd' because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>

Will you please fix this one.

ADD REPLY
0
Entering edit mode
xjc -d . "http://www.uniprot.org/docs/uniprot.xsd"
ADD REPLY
0
Entering edit mode

Here is a link to screenshot, when I run program java Biostar14046, I got these exceptions

ADD REPLY
0
Entering edit mode

compile *all* classes generated by xjc , not just javac Biostar14046.java (something like `find ./ -type f -name "*.java" | xargs javac ` should work.

ADD REPLY
0
Entering edit mode

tried but couldn't able to find any results.

ADD REPLY
2
Entering edit mode
$ xjc -d tmp/WS http://www.uniprot.org/docs/uniprot.xsd
parsing a schema...
compiling a schema...
org/uniprot/uniprot/CitationType.java
org/uniprot/uniprot/CofactorType.java
org/uniprot/uniprot/CommentType.java
org/uniprot/uniprot/ConsortiumType.java
org/uniprot/uniprot/DbReferenceType.java
org/uniprot/uniprot/Entry.java
org/uniprot/uniprot/EventType.java
org/uniprot/uniprot/EvidenceType.java
org/uniprot/uniprot/EvidencedStringType.java
org/uniprot/uniprot/FeatureType.java
org/uniprot/uniprot/GeneLocationType.java
org/uniprot/uniprot/GeneNameType.java
org/uniprot/uniprot/GeneType.java
org/uniprot/uniprot/ImportedFromType.java
org/uniprot/uniprot/InteractantType.java
org/uniprot/uniprot/IsoformType.java
org/uniprot/uniprot/KeywordType.java
org/uniprot/uniprot/LocationType.java
org/uniprot/uniprot/MoleculeType.java
org/uniprot/uniprot/NameListType.java
org/uniprot/uniprot/ObjectFactory.java
org/uniprot/uniprot/OrganismNameType.java
org/uniprot/uniprot/OrganismType.java
org/uniprot/uniprot/PersonType.java
org/uniprot/uniprot/PositionType.java
org/uniprot/uniprot/PropertyType.java
org/uniprot/uniprot/ProteinExistenceType.java
org/uniprot/uniprot/ProteinType.java
org/uniprot/uniprot/ReferenceType.java
org/uniprot/uniprot/SequenceType.java
org/uniprot/uniprot/SourceDataType.java
org/uniprot/uniprot/SourceType.java
org/uniprot/uniprot/StatusType.java
org/uniprot/uniprot/SubcellularLocationType.java
org/uniprot/uniprot/Uniprot.java
org/uniprot/uniprot/package-info.java

$ javac -sourcepath tmp/WS Biostar14046.java  tmp/WS/org/uniprot/uniprot/*

$ java -cp tmp/WS:. Biostar14046$ java  -cp tmp/WS:. Biostar14046
[P01386]    1TXB
[Q8QGR0, P80970, Q9PRZ5]    3NEQ
[P17174, B2R6R7, B7Z7E9, Q5VW80]    3II0
[P08874]    2RO4
[Q8N6N7, A6NCI2, B3KTG8]    3EPY
[Q9SWS1, Q42137]    4O7G
[P25984, Q9R559]    3FTN
[Q9Y4W6, Q6P1L0]    2LNA
ADD REPLY
0
Entering edit mode

Hello, Now, it works with https instead of http. Also doesn't work with Java Version > 8. This is slow though, is there any faster method/way available? Thank You ~ Shashank

ADD REPLY
0
Entering edit mode
13.2 years ago
Fernando ▴ 30

For example lysozyme is a single domain protein, so I define a single domain as something that cannot be further divided (unlike hemoglobin which has 4 domains)

Actually, is there a way to find all small globular proteins? These are usually sinegle domains (~ 100-150 residue)??

I am sorry if these questions sound trivial !

I am looking to compare different small globular proteins structures. (Not using RMSD or FASTA just visul comparision using pymol)

ADD COMMENT
0
Entering edit mode

You should update your original question but not adding a new answer. This one will be deleted soon.

ADD REPLY
0
Entering edit mode
13.2 years ago
Eric T. ★ 2.8k

Some combination of these strategies might do:

  1. Filter NCBI-PDBAA for sequences with length less than, say, 600.

  2. Fetch the FASTA records or PDB files for those matching sequences. Use Biopython to filter for proteins that have (a) one sequence in the FASTA record (via Bio.SeqIO), or (b) one chain in the structure (via Bio.PDB). This will lose a lot of PDB entries where the biological unit is monomeric but the crystal was solved with multiple identical chains -- but I think that's OK for your purposes.

  3. Run RPS-BLAST or HMMer on the PDBAA database, and use a script to filter for sequences that only have one distinct domain. Use a somewhat stringent e-value cutoff to reduce the number of overlapping hits you get. (The possibility of overlapping hits and multiple profile matches for a single domain can make this tricky.)

ADD COMMENT

Login before adding your answer.

Traffic: 3050 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6