The following java program scans uniprot and search for the entries having an entry in PDB and having one and only one entry in prosite:
(firts generate the XML unmarshaller with:
xjc -d . "http://www.uniprot.org/docs/uniprot.xsd
"
then compile (javac Biostar14046.java) and run ( java Biostar14046) the following program:
import java.net.URL;
import java.util.zip.GZIPInputStream;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.XMLEvent;
import org.uniprot.uniprot.DbReferenceType;
import org.uniprot.uniprot.Entry;
public class Biostar14046
{
void run() throws Exception
{
JAXBContext jc = JAXBContext.newInstance("org.uniprot.uniprot");
Unmarshaller u=jc.createUnmarshaller();
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
factory.setProperty(XMLInputFactory.IS_VALIDATING, Boolean.FALSE);
factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE);
XMLEventReader r= factory.createXMLEventReader(new GZIPInputStream(new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz").openStream()));
int i=0;
while(r.hasNext())
{
XMLEvent evt=r.peek();
if(!(evt.isStartElement() && evt.asStartElement().getName().getLocalPart().equals("entry")))
{
r.next();
continue;
}
QName qName=evt.asStartElement().getName();
Entry entry=(Entry)u.unmarshal(r);
int countprosite=0;
String pdb=null;
for(DbReferenceType ref:entry.getDbReference())
{
if(ref.getType().equals("PDB") && ref.getId()!=null)
{
pdb=ref.getId();
}
else if(ref.getType().equals("PROSITE"))
{
countprosite++;
}
}
if(countprosite!=1 || pdb==null) continue;
System.out.println(entry.getAccession()+"\t"+pdb);
}
}
public static void main(String[] args) throws Exception
{
new Biostar14046().run();
}
}
Result:
[Q58097] 2Z61
[P49777, Q9URU7] 1IUF
[P02718] 1OLK
[Q08AH3, B3KTT9, O75202] 3GPC
[P26276] 3C04
[Q9ZCD3] 3MX6
[O35381, P97437] 2JQD
[Q9NQW6, Q5CZ78, Q6NSK5, Q9H8Y4, Q9NVN9, Q9NVP0] 2Y7B
[O43747, O75709, O75842, Q9UG09, Q9Y3U4] 1IU1
[P53068, D6VV95] 1GQP
[P07741, Q3KP55, Q68DF9] 1ZN9
[O50202] 2WFW
[P63590, Q48ZH6, Q9A0E5] 2OCZ
[P0AC38, P04422, P78140, Q2M6G5] 1JSW
[P0ABB8, P39168, Q2M665] 3GWI
[P33447] 1BW0
[P56547] 1RKR
[Q9X108] 1UP7
[P52664] 1HZO
[P0C2P0, P78986, Q0CGS9] 2Z3J
[P14315] 3LK4
[P57730, A2RRF8] 1DGN
[A5JTM5] 1NZY
[Q28960] 1N5D
[P80075, A0AV77, P78388] 1ESR
[P18181, Q545K2] 2PTV
[P31997, O60399, Q16574] 2DKS
[P30429, Q5BHI5] 3LQR
[P36222, B2R7B0, P30923, Q8IVA4, Q96HI7] 1NWU
[Q5PXQ6] 1TMX
[P01524] 1GIB
[Q96LI5, Q9UF92] 3NGQ
[Q9DBL7, A2BFA8, Q3TVZ2, Q8K3Y4] 2F6R
[P49347] 1CNV
[P02526, A2TJU8] 4GCR
[P32081, P41017, Q45690] 2I5M
[P01443] 1KBT
[Q6F495, Q3MV17] 2D04
(...)
Can you clarify what you mean by single-domain proteins? (It might help to state what your research question is.)
Peptides which only have 1 functional domain, ignoring overlaps. These would be identifiable by Pfam or RPS-BLAST search against the PDBAA sequence database for domain architecture.
3D structures that show only 1 domain, ignoring small ligands.
Something else?
For example lysozyme is a single domain protein, so I define a single domain as something that cannot be further divided (unlike hemoglobin which has 4 domains)
Actually, is there a way to find all small globular proteins? These are usually sinegle domains (~ 100-150 residue)??
I am sorry if these questions sound trivial !
I am looking to compare different small globular proteins structures. (Not using RMSD or FASTA just visul comparision using pymol)