I have to construct a protein database of a sequenced organism for a proteomics search. Protein sequences from which repositories out of Genbank, NCBI Refseq and UniprotKB will be better for this purpose?
Thanks
WoA
I have to construct a protein database of a sequenced organism for a proteomics search. Protein sequences from which repositories out of Genbank, NCBI Refseq and UniprotKB will be better for this purpose?
Thanks
WoA
UniprotKB contains the most rich, accurate, high-quality data. Genbank contains raw data, it could be very redundant, and you might have to do a lot of filtering yourself. Refseq is not so richly annotated, but at least it's only non-redundant sequences.
So my first choice would be to go with UniprotKB, second RefSeq, and third Genbank. But it also depends on whether the organism you're interested in has sufficient data in each resource.
Would you care to share which organism you're interested in?
You can find the answer to your second question: "what is the difference between Uniprot "Complete Proteome set" and the combined reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries?" on the [?]UniProt Homepage[?]:
UniProt really is a combination of two resources: SwissProt and trEMBL.
SwissProt is a high quality, because highly curated, real protein database. In fact it is one of the oldest databases we have and it is maintained by real protein experts.
trEMBL on the other hand is not a database of real proteins at all. It is a database of translated nucleotide sequences from EMBL (hence trEMBL). These can very well not-exist in real biology or just be wrongly translated (miss an exon or whatever). The two were combined for practical reasons but it is very good to be aware of the difference.
When you go to download the FASTA (assuming that is what you are using), e.g. http://www.uniprot.org/uniprot/?query=organism%3a9606+keyword%3a181&format=*, you get a choice to download the canonical sequence data, or canonical and isoform sequence data. The latter presumably includes splice variants as separate protein entries.
What I would like to see is data that can link to mRNA isoforms. RefSeq allows this. GenBank would be noisy as Martijn says. The mRNA isoforms can be important because they are expressed to different levels according to cell type, temporal patterns (circadian, developmental), and responses to stimuli. These points could be quite critical to the design of the experiment whose data you'll now analyze or critical to the hypotheses addressed.
For mass spectrometry–based proteomics, the International Protein Index (IPI, http://www.ebi.ac.uk/IPI/IPIhelp.html) has been a popular choice for common organisms. For some reason they don't have yeast but Saccharomyces Genome Database (SGD, http://www.yeastgenome.org/) fills in nicely there. However, IPI is closing soon, and they recommend UniProt complete proteome sets (http://www.uniprot.org/faq/15) as a replacement. Overall, UniProt seems to provide good information for pretty much any organism, even if it doesn't have a complete proteome set yet, and it is definitely the most extensive, so I would recommend just going there for everything.
NCBI nr db for protein is explained here: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Many Thanks !!! Can somebody tell me what is the difference between Uniprot "Complete Proteome set" and the combined reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries
For some organisms the difference is negligible but for others, so far I've seen the difference is by around 100 entries.