Limiting number of NCBI search results and downloading sequences
1
0
Entering edit mode
8.2 years ago
jmah ▴ 30

My goal is to download a specific number of sequences (eg. 1 million) from the NCBI protein database but exclude a specific organism. I can do an Entrez search to exclude the organism, but I can't find a way to limit the number of sequences I download. It's also not feasible to download all results (too large for the NCBI website).

Is there a particular Entrez command or script that would allow this? There was a similar question posted previously, but no direct answer was found.

Also, is there a more effective way to download sequences from search results other than selecting the 'Send to: file' drop down menu in the upper right of the screen? Often, when downloading larger numbers of sequences the connection ultimately breaks.

Thanks for your help! Please direct me/correct me if this question has in fact been already answered.

ncbi search • 1.8k views
ADD COMMENT
3
Entering edit mode
8.2 years ago
GenoMax 147k

You could download the fasta file for nr sequence database here. Limiting your search to a million would need to be done carefully since you may bias your dataset for certain organisms (I am not sure what order NCBI puts those sequences in the file). I would use this solution afterwards: A: Choosing Random Set Of Seqs From Larger Set

You could exclude the species you don't want after sampling or you could remove them from original nr file with a solution like this (there are more on biostars): How To Remove Certain Sequences From A Fasta File

ADD COMMENT
0
Entering edit mode

Thank you very much! Exactly what I was looking for. Also, thanks for the heads up about bias.

ADD REPLY

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6