Hi all,
I downloaded fasta sequences from NCBI FTP site with the method described in http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete . Recently, I used my customised database for blast and got many desired results. However, one thing I noticed is that some of the sequences have been updated/removed since the last time I download the whole genome dataset. And some of these removed sequences actually interfered with my results because they prevented the detection of one of my spiked-in organism. Thus, I'm wondering if there is a simple way to update my genome database using command-line tools like eDirect utils. I wish to avoid re-downloading my database at all because that's just a waste of resources.
To give an example, sequence NC_000521.3 is updated to AL844502.1 then to AL844502.2. And sequence NW_001850357.1 has been completely removed from the NCBI database.
The example you includes is a difference in the minor version of that accession # so it is not likely to change your blast result significantly. As for sequences that have been removed you may want to remove them based on @Piet/@Matt Shirley's strategies. This would need some careful tweaking but once you create the necessary scripts the process should be reasonably painless.
You have not said what you are doing with this custom database but it sounds like you have a need to repeat the analysis with some frequency. I am not sure why you are worried about resources (unless you are paying for the bandwidth/storage) since getting the right answer has higher priority. This simply can be considered cost of doing bioinformatics.
Thank you for all your responses! To add into @genomax2 's question, I'm building local copies of all virus, bacteria, and fungi databases for detection of organisms from Illumina sequencing runs using local blast and RAPSearch2. Minor versions are alright but some of the removed sequences really altered my results because my 100bp read mapped 100% to the removed NW_001850357.1 and 99% to the species I'm spiked in (Candida albicans). Since I was only taking the top hit as the organism detected, I missed a bunch of the Candida albicans in my report. Bandwidth/storage is not too much a concern but I wouldn't want to re-download everything for each update since I'm using a shared computing server and that may slightly affect other people's projects.
I took a similar approach as @piet did. I've tried with wget -N command. It seems that the server still downloads the file but just not store them. Thus, i wrote a script with the unix command
if [ -f $report ]; then echo $report' exists!'; fi
to check and skip any downloaded assembly file that's in storage. However, I'm not able to check every sequence in each assembly for the sequence's update. Do you know if it's possible to update a certain sequence only?I haven't tried @Matt Shirley's approach yet. I'll give it a shot.
One more thing I noticed is that old version/removed genome sequences tend to not have taxon id assigned to them anymore. I'm not sure if this is the case for every sequence. If it is, I can run a taxonomy search for all the sequences and produce all the sequence that don't have taxon id assigned. Then, I may have to manually re-download, replace, and rebuild the database with the new sequences. I've written in another post on how to Retrieve a subset of FASTA from large Illumina multi-FASTA file Retrieve a subset of FASTA from large Illumina multi-FASTA file . It can also be modified to work for the databases.