I would like to know what is the best way to retrieve the human proteins sequences that contains a given domain (e.g. : FYVE). Thanks in advance for sharing your approach(es).
I would like to know what is the best way to retrieve the human proteins sequences that contains a given domain (e.g. : FYVE). Thanks in advance for sharing your approach(es).
If you have a "type" or a "definition" defined in uniprot (I don't know if it is a controlled vocabulary), here is my java solution.
Compilation:
xjc "http://www.uniprot.org/support/docs/uniprot.xsd"
javac Biostar5862.java org/uniprot/uniprot/*.java
java Biostar5862 -t "transmembrane region"
>[11011_ASFP4]|26-46
PFGCNMKGLGVLLGLFSLILA
>[11011_ASFP4]|154-174
LTLKQYCLYFIISIAFAGCFV
>[11011_ASFP4]|183-203
LNTTIKLLTLLSILVYLAQPV
>[141R_IIV6]|49-69
YIIYAIVAAILLLLFWLLYKK
>[14KD_RHOSH]|85-102
LGGFASGALLALALAGIF
>[1A29_HUMAN]|309-332
VGIIAGLVLFGAVFAGAVVAAVRW
>[1B01_PANTR]|306-329
GIVAGLAVLVVTVAVVAVVAAVMC
>[1B54_HUMAN]|309-332
VGIVAGLAVLAVVVIGAVVATVMC
>[1C18_HUMAN]|309-333
VGIVAGLAVLVVLAVLGAVVAVVMC
>[34KD_MYCPA]|42-62
IAVVALGFAAYLLNFGPTFTI
java Biostar5862 -d "FYVE-type" | head -n 20
>[FGD1_MOUSE]|729-789
EKEVTMCMRCQEPFNSITKRRHHCKACGHVVCGKCSEFRARLIYDNNRSNRVCTDCYVALH
>[LST2_DROMO]|965-1025
DGKAPRCMSCQTPFTAFRRRHHCRNCGGVFCGVCSNASAPLPKYGLTKAVRVCRECYVREV
>[RFFL_HUMAN]|41-96
TGLEPSCKSCGAHFANTARKQTCLDCKKNFCMTCSSQVGNGPRLCLLCQRFRATAF
>[RNF34_BOVIN]|56-107
EGPNIVCKACGLSFSVFRKKHVCCDCKKDFCSVCSVLQENLRRCSTCHLLQE
>[RUFY1_HUMAN]|642-700
DDEATHCRQCEKEFSISRRKHHCRNCGHIFCNTCSSNELALPSYPKPVRVCDSCHTLLL
>[SYTL4_MOUSE]|63-105
CARCQEGLGRLIPKSSTCVGCNHLVCRECRVLESNGSWRCKVC
Here is my Approach to find [?]MYDOMAIN[?]:
1) Got to http://smart.embl.de/ in Genomic Mode (this mode should avoid redundancy)
2) In the [?]Domains detected by SMART[?] section, you type [?]MYDOMAIN[?] in the keywords text box and click "Search for keywords".
3) In the card of your domain of interest click on "Evolution (species in which this domain is found)".
4) Then click on the "Homo sapiens" shortcut to get to the human node.
5) So if you click on the Homo Sapiens node you get access to the "Proteins in Homo sapiens with [?]MYDOMAIN[?] domain" card.
6) From this page you have access to all the protein sequences related to your domain of interest in fasta format.
You can use BioMart web interface in Ensembl. There's a specific filter for genes with a given domain and you can use a broad range of cross ref identifiers (Pfam, Interpro, etc.). I really like this approach 'cause it permits to relate domain with gene structure, to get sequence variation and a lot of other very useful things. Of course, you can't obtain all kinds of raw data (sequences, structures, etc.).
Have you ever tried it?
Already nice solutions here: if it is for one or two protein domain families you can get the list of all domains in an organism using the Species tab(Species distribution) in Pfam. Click on the Check-box next to your organism of interest; then click on Download to download tezt file with sequence accessions or sequences in FASTA format. Pfam also provides a list of domain architecture with FVYE in human. Here is the link to access architecture of 74 Sequences with FYVE domain.
Another way to get them is using the advanced search in Uniprot. First select the "Domain" option in the field box and type "FYVE". Then click "Add & Search" and select "Organism" in the field box and type/select "Homo sapiens".
the other day I tried to upload images to the post and I couldn't get it so I've written a short post about this here http://blog.ohnosequences.com/?p=136
Obviously this is not an approach to perform searches programmatically but combining the options you have in this advanced search interface you can perform quite complex searches
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks master Pierre. I would be curious of the result if you use as filter "FYVE" for the section "Sequence similarities Contains 1 FYVE-type zinc finger" and "9606" in the section "Taxonomic identifier 9606 [NCBI]". But I don't their place in the xml file
Thanks master Pierre. I would be curious of the result if you use as filter "FYVE" for the section "Sequence similarities Contains 1 FYVE-type zinc finger" and "9606" in the section "Taxonomic identifier 9606 [NCBI]". But I don't know their place in the xml file. Here is web instance I used : http://www.uniprot.org/uniprot/Q96K21
uniprot.org/uniprot/Q96K21.xml gives you the answer for the taxonomy: the path is uniprot/entry/dbReference[@type="NCBI Taxonomy" and @id="9606"]. see the generated classes to see how to get this object.