NCBI vs Ensembl - which one to chose - for downloading protein fasta files
2
0
Entering edit mode
7.5 years ago
Idit • 0

Hi

[I'm a newbie in bioinformatics, my apologies for misusage of terms, if any..]

I need to decide which resource to use, to download many species full protein fasta files, in order to run many blastp queries for all human proteins against each of the species. I would like to download most of the Eukaryotes species files that exist. I checked some species from both Ensembl and NCBI latest releases, and saw that there are big differences between them.

For example, when I downloaded the protein fasta file of "Otolemur garnettii", The Ensembl fasta has 19986 proteins, whereas the NCBI fasta has 26925. When running a sample blastp for some human protein sequence against each of these protein files (after running makeblastdb of course), the highest bitscore is very different between Ensembl & NCBI.

Also, when I run blastp for the same species, Ensembl vs NCBI and vice versa, I get > 1000 proteins with %identity < 30, which I understand as proteins that exists in one resource and not in the other one (?)

I know they use different gene annotation methods, so it makes sense there are differences, but my question is, did you have experience with working with both resources, and do you have any recommendations, which resource to chose to work with?

Thanks a lot,

Idit

blast fasta NCBI Ensembl • 8.1k views
ADD COMMENT
2
Entering edit mode

Uniprot has built a database of reference proteomes for most organisms sequences today: http://www.uniprot.org/proteomes/

ADD REPLY
0
Entering edit mode

Thanks, I downloaded all the Eukaryotes I needed from the UniProt FTP site

ADD REPLY
5
Entering edit mode
7.5 years ago

Different resources have differences because they do not have the same focus. For example, EnsEMBL is about annotating genomes whereas UniProt is about collecting and annotating proteins and thus doesn't have a notion of underlying genome. If you need data integration at the genome level, e.g. you need to refer to genes at some point, then you're better off working with a well organized genome annotation resource like EnsEMBL which already has integrated plenty of information. Whichever resource you choose make sure you understand what it is about and how this impacts your work. Also don't try a mix and match approach between resources, this is asking for trouble.

ADD COMMENT
0
Entering edit mode

For this project I only need the protein sequences and not the genomic annotation, so it looks like I will go for UniProt. Thanks for the mix & match warning, I almost did it..

ADD REPLY
1
Entering edit mode
7.5 years ago
Whoknows ▴ 960

None !!

It is better to download from UniProt, also you could download Refseq protein website NCBI, but in my experience UniProt gives more information and is much updated than NCBI and Ensemble.

The other advantage of UniProt is you could obtain SWISS-prot manually curated entries or TrEMBL for in-silico predicted protein.

ADD COMMENT
0
Entering edit mode

Thanks, this is what I did. It is still interesting to see that for some species UniProt has the almost the same set of proteins as NCBI, and for other species it's more close to Ensembl.

ADD REPLY

Login before adding your answer.

Traffic: 1776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6