Hi,
I work on some protist genomes and because the protists live in symbiosis with bacteria usually I have bacterial contamination in the NGS data.
Previously, I was using a script to decontaminate the genomic data from bacterial hits, and among other things, the script was based on blastn and blastp, and instead of having subset of bacteria+archea database, I was using GI lists to filter the blast against nt and nr (using -gilist -negativegilist parameter)
I did not create custom subsets because I was thinking that is easier to update from time to time the gi-lists and the entire nt and nr database, instead of doing everything mention above and also to update the custom subset databases.
Now, NCBI is phasing out GI numbers, as I am sure everybody knows, and I am quite stuck with my script. I cannot use accession number as a way to filter the results, and also I don't know any way to make a custom subset of the nt or nr database using accession numbers and not GI numbers. I know that with GI numbers you could use the blastdb_aliastool to create a custom database based on your gilist but is there a way to do the same thing with accession numbers?
For my script it is very important somehow to have the filtering done in the blast, or to have custom blast database based on taxonomy, because I use custom outfmt, and one of the methods used in classification of a sequence as "contaminant" are coverage and identity thresholds which are calculated for the entire sequence length, and based on values from the blast results. It would not work if I would not get the results in the same order as blastn or blastp outputs them, which most probably would be the case if I would use some post filtering.
Thank you in advance for any help
Hello,
I think I completely missed this parameter of the blast. I will try to download the accession number for some organism and try if to see how it will work and post back.
Anyway thank you very much for pointing out this parameter.
There's a new version of blast (2.5.0).
Sorry for the late rely. Thank you for the information, but I don't know how well this version will help me. However, as I saw there is still possibility to use the GI numbers, and even now you can still download the GI list. I will try to use eutils to download the accession list and use seqid as trial but I am worried that I will get an error as with gilist. I saw that if the GI list is too big I get an error in blast "Error vector::reserve". So I usually use gilist (if the target seqs are the smaller set) or negative_gilist, if the target GI list is huge. But there is no negative_seqidlist, so I will see if I get the error if the seqidlist is too big.
Hello,
I don't know what's with NCBI but to download the accession list for a group it will take ages. I don't know why the download of the accession list goes with max 1 kb/s but I am able to download the gilist with 250kb/s and if I try to download all sequences as fasta for a group I can download with several MB/s. It took me half a day to download the archea nt accession list, which is roughly 6 MB file. In this way I think it will be faster to download using entrez the entire eukaryote and bacteria fasta files for both nt and nr and concatenate the file and afterwards create the custom database, since the download goes much more faster even if I download tens of GB of data.