As Tony suggested you can do this easily in R using biomaRt:-
library(biomaRt);
ensembl <- useMart('ensembl',dataset='mmusculus_gene_ensembl');
#get all genes with affyids just for this example
affys <- getBM(mart=ensembl,attributes='affy_mg_u74a');
#just take 100 for this example
affys <- affys[1:100,]
#get genes
affy_genes <-getBM(
mart=ensembl,attributes=c('mgi_automatic_gene_symbol','mgi_curated_gene_symbol','affy_mg_u74a'),
filters=c('affy_mg_u74a','go'),
values=list(affys,'GO:0005886')
);
This is just an example you should check what attributes and which array set you want to use using the listAttributes() and listFilters() functions. Hope this is useful.
Ian thanks a lot for the detailed reponse but I was more looking for filtering suggestions like Daniel did. May be my question was not well formulated. Anyway thanks for the code sample.
You're welcome. As you probably noticed there are a very large number of filtering options available with BioMart which you could certainly formulate into detailed analyses. The benefit of using Ensembl is that it essentially syndicates a lot of the data that you might otherwise have to trawl multiple databases (and with different access methods) to find. The pipelines also include many predictive tool outputs that you might not have realised were hidden in there !
If you or someone in your neighborhood know R programming, I would consider using biomaRt R package. It could be a straightforward way for such a task.
You could also, once you have found a method to translate your identifiers, look for matches with genes that exist within existing membrane protein databases.
A quick Google suggests that a couple exist and are up to date, I'm sure the NAR database issue would supply half a dozen more.
The former at least offers downloads (the latter does not appear to) which would facilitate bringing the data into whatever package you're using for your array analysis.
I'd be interested to see how the GO term approach intersects with this one.
Others have covered the use of BioMart and GO annotations -- seems like a perfectly reasonable route. But I'm not sure how complete the GO annotations for localization are. My understanding is that there are membrane-bound proteins that are not annotated with the corresponding GO term (but I don't have a concrete example unfortunately).
Depending on your tolerance for false negatives, you might also consider downloading Phobius and running it locally against a protein sequence file. For example, you might download the Refseq protein fasta file from NCBI's ftp site.
If you have a set of probesets, just go to DAVID tool,and there is a option to paste your probeset IDs. Ypu will get the annotation there according to your interest. Hope you will find it useful.
Affymetrix annotations also have TMHMM predictions on all the genes associated with the probe sets - its in the annotation.csv file that's updated regularly. you can just do a quick search on that column.
I think I should probably add that you need to watch out for promiscuous probe-sets, ones that map to more than one gene. This is discussed in another answer on this site Is It Possible For Two Different Affymetrix Probe Set Id To Have Common Annotations To Same Gene ?
Ian thanks a lot for the detailed reponse but I was more looking for filtering suggestions like Daniel did. May be my question was not well formulated. Anyway thanks for the code sample.
You're welcome. As you probably noticed there are a very large number of filtering options available with BioMart which you could certainly formulate into detailed analyses. The benefit of using Ensembl is that it essentially syndicates a lot of the data that you might otherwise have to trawl multiple databases (and with different access methods) to find. The pipelines also include many predictive tool outputs that you might not have realised were hidden in there !