How To Retrieve Human Proteins Sequence Containing A Given Domain

Entering edit mode

14.2 years ago

Fred Fleche 4.3k

I would like to know what is the best way to retrieve the human proteins sequences that contains a given domain (e.g. : FYVE). Thanks in advance for sharing your approach(es).

human protein protein sequence fasta • 8.0k views

ADD COMMENT • link updated 14.2 years ago by Khader Shameer 18k • written 14.2 years ago by Fred Fleche 4.3k

Entering edit mode

14.2 years ago

Pierre Lindenbaum 166k

If you have a "type" or a "definition" defined in uniprot (I don't know if it is a controlled vocabulary), here is my java solution.

	import java.io.InputStream;
	import java.net.URL;
	import java.util.zip.GZIPInputStream;

	import javax.xml.bind.JAXBContext;
	import javax.xml.bind.Unmarshaller;
	import javax.xml.namespace.QName;
	import javax.xml.stream.XMLEventReader;
	import javax.xml.stream.XMLInputFactory;
	import javax.xml.stream.events.XMLEvent;
	import org.uniprot.uniprot.Entry;
	import org.uniprot.uniprot.FeatureType;
	import org.uniprot.uniprot.LocationType;
	import org.uniprot.uniprot.PositionType;

	/* xjc "http://www.uniprot.org/support/docs/uniprot.xsd" */
	public class Biostar5862
	{
	private String description=null;
	private String type=null;

	private void run() throws Exception
	{
	JAXBContext jc = JAXBContext.newInstance("org.uniprot.uniprot");
	Unmarshaller decoder=jc.createUnmarshaller();


	URL url=new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz");
	InputStream in=new GZIPInputStream(url.openStream());
	XMLInputFactory factory = XMLInputFactory.newInstance();
	factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
	factory.setProperty(XMLInputFactory.IS_VALIDATING, Boolean.FALSE);
	factory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
	factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
	XMLEventReader r= factory.createXMLEventReader(in);
	while(r.hasNext())
	{
	XMLEvent evt=r.peek();
	if(!evt.isStartElement()) { r.next(); continue;}
	QName qName=evt.asStartElement().getName();
	if(!qName.getLocalPart().equals("entry")) { r.next(); continue;}

	Entry entry= decoder.unmarshal(r,Entry.class).getValue();

	for(FeatureType featureType:entry.getFeature())
	{
	boolean ok=false;
	if(this.description!=null &&
	this.description.equalsIgnoreCase(featureType.getDescription()))
	{
	ok=true;
	}
	if(this.type!=null &&
	this.type.equalsIgnoreCase(featureType.getType()))
	{
	ok=true;
	}
	if(!ok) continue;
	LocationType locType=featureType.getLocation();
	if(locType==null) continue;
	PositionType begin=locType.getBegin();
	PositionType end=locType.getEnd();
	PositionType pos=locType.getPosition();
	if(pos!=null)
	{
	int n=pos.getPosition().intValue()-1;
	System.out.println(">"+entry.getName()+"\|"+pos.getPosition());
	System.out.println(entry.getSequence().getValue().substring(n-1,n));
	}
	else if(end!=null && begin!=null)
	{
	int n1=begin.getPosition().intValue()-1;
	int n2=end.getPosition().intValue()-1;
	System.out.println(">"+entry.getName()+"\|"+begin.getPosition()+"-"+end.getPosition());
	System.out.println(entry.getSequence().getValue().substring(n1,n2+1));
	}
	}
	}
	r.close();
	in.close();
	}

	public static void main(String[] args)
	{
	try {
	Biostar5862 app=new Biostar5862();
	int optind=0;
	while(optind<args.length)
	{
	if(args[optind].equals("-h"))
	{
	return;
	}
	else if(args[optind].equals("-d"))
	{
	app.description=args[++optind];
	}
	else if(args[optind].equals("-t"))
	{
	app.type=args[++optind];
	}
	else if(args[optind].equals("--"))
	{
	optind++;
	break;
	}
	else if(args[optind].startsWith("-"))
	{
	System.err.println("Unnown option: "+args[optind]);
	return;
	}
	else
	{
	break;
	}
	++optind;
	}

	app.run();
	}
	catch (Exception e)
	{
	e.printStackTrace();
	}
	}
	}

view raw biostars-5851.java hosted with ❤ by GitHub

Compilation:

xjc "http://www.uniprot.org/support/docs/uniprot.xsd"
javac Biostar5862.java org/uniprot/uniprot/*.java

Test with type="transmembrane region"

java Biostar5862 -t "transmembrane region"
>[11011_ASFP4]|26-46
PFGCNMKGLGVLLGLFSLILA
>[11011_ASFP4]|154-174
LTLKQYCLYFIISIAFAGCFV
>[11011_ASFP4]|183-203
LNTTIKLLTLLSILVYLAQPV
>[141R_IIV6]|49-69
YIIYAIVAAILLLLFWLLYKK
>[14KD_RHOSH]|85-102
LGGFASGALLALALAGIF
>[1A29_HUMAN]|309-332
VGIIAGLVLFGAVFAGAVVAAVRW
>[1B01_PANTR]|306-329
GIVAGLAVLVVTVAVVAVVAAVMC
>[1B54_HUMAN]|309-332
VGIVAGLAVLAVVVIGAVVATVMC
>[1C18_HUMAN]|309-333
VGIVAGLAVLVVLAVLGAVVAVVMC
>[34KD_MYCPA]|42-62
IAVVALGFAAYLLNFGPTFTI

Test with d="FYVE-type"

java Biostar5862 -d "FYVE-type" | head -n 20
>[FGD1_MOUSE]|729-789
EKEVTMCMRCQEPFNSITKRRHHCKACGHVVCGKCSEFRARLIYDNNRSNRVCTDCYVALH
>[LST2_DROMO]|965-1025
DGKAPRCMSCQTPFTAFRRRHHCRNCGGVFCGVCSNASAPLPKYGLTKAVRVCRECYVREV
>[RFFL_HUMAN]|41-96
TGLEPSCKSCGAHFANTARKQTCLDCKKNFCMTCSSQVGNGPRLCLLCQRFRATAF
>[RNF34_BOVIN]|56-107
EGPNIVCKACGLSFSVFRKKHVCCDCKKDFCSVCSVLQENLRRCSTCHLLQE
>[RUFY1_HUMAN]|642-700
DDEATHCRQCEKEFSISRRKHHCRNCGHIFCNTCSSNELALPSYPKPVRVCDSCHTLLL
>[SYTL4_MOUSE]|63-105
CARCQEGLGRLIPKSSTCVGCNHLVCRECRVLESNGSWRCKVC

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

Thanks master Pierre. I would be curious of the result if you use as filter "FYVE" for the section "Sequence similarities Contains 1 FYVE-type zinc finger" and "9606" in the section "Taxonomic identifier 9606 [NCBI]". But I don't their place in the xml file

ADD REPLY • link 14.1 years ago by Fred Fleche 4.3k

Entering edit mode

ADD REPLY • link 14.1 years ago by Fred Fleche 4.3k

Entering edit mode

uniprot.org/uniprot/Q96K21.xml gives you the answer for the taxonomy: the path is uniprot/entry/dbReference[@type="NCBI Taxonomy" and @id="9606"]. see the generated classes to see how to get this object.

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 166k

Entering edit mode

14.2 years ago

Fred Fleche 4.3k

Here is my Approach to find [?]MYDOMAIN[?]:

1) Got to http://smart.embl.de/ in Genomic Mode (this mode should avoid redundancy)

2) In the [?]Domains detected by SMART[?] section, you type [?]MYDOMAIN[?] in the keywords text box and click "Search for keywords".

3) In the card of your domain of interest click on "Evolution (species in which this domain is found)".

4) Then click on the "Homo sapiens" shortcut to get to the human node.

5) So if you click on the Homo Sapiens node you get access to the "Proteins in Homo sapiens with [?]MYDOMAIN[?] domain" card.

6) From this page you have access to all the protein sequences related to your domain of interest in fasta format.

ADD COMMENT • link 14.2 years ago by Fred Fleche 4.3k

Entering edit mode

14.2 years ago

Jarretinha 3.5k

You can use BioMart web interface in Ensembl. There's a specific filter for genes with a given domain and you can use a broad range of cross ref identifiers (Pfam, Interpro, etc.). I really like this approach 'cause it permits to relate domain with gene structure, to get sequence variation and a lot of other very useful things. Of course, you can't obtain all kinds of raw data (sequences, structures, etc.).

Have you ever tried it?

ADD COMMENT • link 14.2 years ago by Jarretinha 3.5k

Entering edit mode

Thanks a lot for the suggestion. Unfortunately it seems that I can not use the SMART Ids to filter the genes set.

ADD REPLY • link 14.2 years ago by Fred Fleche 4.3k

Entering edit mode

I've check it again. Smart IDs are there too. Go to Filters -> Protein Domains -> Limit to genes -> with Protein feature smart IDs.

ADD REPLY • link 14.1 years ago by Jarretinha 3.5k

Entering edit mode

I do agree but you can not enter a specific SMART ID to filter. You can only do that with the select item just below that do not cantain a SMART option.

ADD REPLY • link 14.1 years ago by Fred Fleche 4.3k

Entering edit mode

That's true! Besides that, Ensembl will return only SMART ACCN (erronously called IDs). But, there's a workaround! The BioMart web interface also generates a perl script. You just need to add a few lines.

ADD REPLY • link 14.1 years ago by Jarretinha 3.5k

Entering edit mode

14.2 years ago

Khader Shameer 18k

Already nice solutions here: if it is for one or two protein domain families you can get the list of all domains in an organism using the Species tab(Species distribution) in Pfam. Click on the Check-box next to your organism of interest; then click on Download to download tezt file with sequence accessions or sequences in FASTA format. Pfam also provides a list of domain architecture with FVYE in human. Here is the link to access architecture of 74 Sequences with FYVE domain.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Khader Shameer 18k

Entering edit mode

Thanks Kadher. I tested your approach and it works well. If I restrict to the genes that encode the proteins I got around 40 genes that match with genes from Marina's method.

ADD REPLY • link 14.1 years ago by Fred Fleche 4.3k

Entering edit mode

14.2 years ago

Marina Manrique ★ 1.3k

Another way to get them is using the advanced search in Uniprot. First select the "Domain" option in the field box and type "FYVE". Then click "Add & Search" and select "Organism" in the field box and type/select "Homo sapiens".

the other day I tried to upload images to the post and I couldn't get it so I've written a short post about this here http://blog.ohnosequences.com/?p=136

Obviously this is not an approach to perform searches programmatically but combining the options you have in this advanced search interface you can perform quite complex searches

ADD COMMENT • link 14.2 years ago by Marina Manrique ★ 1.3k

Entering edit mode

Your solutions works quite well and return 40 human proteins instead of 28 for mine. The additional proteins seem to not be false positive so it is nice.

ADD REPLY • link 14.1 years ago by Fred Fleche 4.3k

Entering edit mode

Your solution works well and return 40 human proteins instead of 28 for mine. The additional proteins seem to not be false positive so it is cool.

ADD REPLY • link 14.1 years ago by Fred Fleche 4.3k

Entering edit mode

I was more talking about human genes coding for proteins with this domain

ADD REPLY • link 14.1 years ago by Fred Fleche 4.3k