Alternative To Local Blastplus To Blast 10000'S Sequences On Nr, Swissprot And Nt
3
1
Entering edit mode
14.1 years ago

Hi,

Here is a case scenario that happens quite often to me: I need to blast from 1,000 to 20,000 sequences in order to find the proteins these sequences code for. These sequences come from fish cDNA libraries, so I expect most of them, although not all, to code for proteins.

I presently use 'blastplus' locally to query both swissprot and nr, but this approach is not so satisfactory for a few reasons:

  1. It is very slow (up to a few days for nr)
  2. I would also like to query nt (I did not succeed, it took much too long)
  3. With a faster method, I would consider blasting a number of sequences a few orders of magnitudes higher

I was investigating the Usearch set of tools, but the ublast method cannot do the equivalent of a blastx, searching for nucleotide hits on a protein database.

What method would you suggest?

Cheers!

sequence blast software • 7.1k views
ADD COMMENT
0
Entering edit mode

Have you tried translating each cDNA sequence into protein and then just use the longest ORF to BLAST? - this should speed things up about 3 times already.

ADD REPLY
4
Entering edit mode
14.1 years ago
Yannick Wurm ★ 2.5k

Salut Eric,

I would stick with blast if possible. It's the one standard thing everyone (reviewers!) is familiar with.

  1. get access to a "big" server (there must be some in laval!). I'm running that kind of blasts on a 24-core machine all the time. It makes things a lot faster (and keeps my macbook from overheating!)
  2. keep only the top hit: The more you output, the more details blast needs to calculate (eg: i think it optimizes the local alignment if displayed)
  3. increase the minimum e-value param (same reason as 2.)
  4. do you need to do vs. NR? How about "only" swissprot + some fish datasets?... its unlikely that the 12th Dipteran proteome will add that much info you don't already have in the other 11...
  5. changing wordsize has huge impacts on blast speed (longer = faster). But you'll also lose some sensitivity.
  6. Do you need to query nt with all of your sequences? or only those that didn't have a protein-db match?

++ y

ADD COMMENT
0
Entering edit mode

Hi Yannick. All very sensitive suggestions that I'll implement. I'm in the process of gaining access to a new super computer we got on campus, maybe I'll try to use it for that purpose, else I'll use the 48 old cores we have at the Institute to do the job. I am reblasting everything (even those with matches) on nr, but I'll follow your suggestion and make a mask on it to keep only the vertebrates, at most. Thanks again!

ADD REPLY
1
Entering edit mode
14.1 years ago
Rm 8.3k

To scale up blast runs, You can use Timelogic "Tera Blast", DeCypher® FPGA Biocomputing Systems

Its a commercial one though.

We recently implemented one such system at our department with multiple Acceleration cards.

(If you use Ublast: translate the nucleotide sequences and then run against protein database.)

Adding : FastHMM and FastBLAST: Tools for Analyzing Large Protein Sequence Databases

I havent tried it

ADD COMMENT
0
Entering edit mode

Hi RaghuM, I would much prefer a free solution, but I'll have a look at your proposed software. Concerning Ublast, you suggest that I make all the 6 possible proteins out of my sequences and then ublast them on my protein database in fasta format? Cheers

ADD REPLY
0
Entering edit mode

yes translate to six frmaes and search. I have added FastBLASt link , see if it is useful to you

ADD REPLY
1
Entering edit mode
14.1 years ago
Darked89 4.7k
  1. reduce the query set by:

    • filtering i.e using seqclean
    • check for possible retroelements and ribosomal RNA in your EST set
    • cluster them i.e @90% identity using uclust, or do a quick and dirty assembly using i.e cap3
  2. reduce the database size (see Yannic's post). Use i.e. UniRef instead of nr, possibly reduced further.

  3. perform a two step search, where you search first against clustered all known fish or vertebrate proteins, set a threshold, blast everything not finding a strong hit against larger database. This is suitable for EST set not contaminated by other DNA. I have seen plant(?) ESTs hitting genomic bacterial contigs.
  4. consider using a cluster and possibly other implementation of blast. see here
  5. Not sure if it works, but according to this page, you may use -q=dnax and -t=prot for blastx-like blat searches.

Edit: reformated for clarity

ADD COMMENT
0
Entering edit mode

Hi darked89. Thanks for the additional info. I'll look into the other blast implementations if the other suggestions are not totally satisfying. Cheers

ADD REPLY

Login before adding your answer.

Traffic: 1405 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6