Should I use SQL to store my BLAST results ?
1
0
Entering edit mode
9.1 years ago
moranr ▴ 290

Hi,

I have no experience with SQL. But I am interested in gaining skills with databases and I may have an opportunity to do so now.

I have a HUGE BLAST output - almost 1TB. I know it probably depends on what I want to do, but should I put my BLAST results into a database?

What are the advantages to this , or when would one want to do this?

Thanks

SQL blast • 2.5k views
ADD COMMENT
1
Entering edit mode

As you said, first think over what you want to do with these hits, and then decide whether to keep them in a database.

ADD REPLY
0
Entering edit mode

Dumb question, but what is your blast format output? Look up https://molevol.mbl.edu/wiki/index.php/BLAST_UNIX_Tutorial . The default output format is really wordy, for large queries the tabular is way better.

As for saving your Blast results in a database, look up OrthoMCL (orthologous gene search) or Trinotate. Each one blasts a genome (~20000 sequences) against a very large reference database (~1M seqs), and processes the output for scoring (OrthoMCL) or correlation with other sources of information (Trinotate).

For simple scripting with manageable amounts of data, I just use the linux 'join' command. Configuring a database gets old quickly. You can also look up 'makeblastdb' and 'blastdb', but a Blast database is in this sense just a index/header/sequences triplet of files, which is different from SQL.

ADD REPLY
0
Entering edit mode
9.1 years ago

It only depends of your needs. if your queries are 'linear' (e.g: grep) then I don't see the advantage of this. If you need some SQL queries between the Hit/Hsp then sql might be one of the possible choices.

(using https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2sqlite.xsl (won't work for a very large XML file ))

xsltproc blast2sqlite.xsl  ~/blastn.xml  | sqlite3 db.sqlite3

sqlite3 db.sqlite3

sqlite> select H1.def,H2.def,S1.align_len from Hsp as S1, Hsp as S2 , Hit  as H1, Hit as H2 where S1.hit_id = H1.id and S2.hit_id=H2.id and S1.align_len=S2.align_len and H1.id!=H2.id limit 2 ;
Human rotavirus A strain 0613158-CA NSP3 (NSP3) gene, complete cds >gi|320543002|gb|HQ609571.1| Human rotavirus A isolate 613158 NSP3 (NSP3) gene, complete cds|Human rotavirus segment 7 NSP3 gene, complete cds|1074
Human rotavirus A strain 0613158-CA NSP3 (NSP3) gene, complete cds >gi|320543002|gb|HQ609571.1| Human rotavirus A isolate 613158 NSP3 (NSP3) gene, complete cds|Human rotavirus A strain RVA/Human-tc/AUS/McN13/1980/G3P2A[6] nonstructural protein 3 (NSP3) gene, complete cds|1074
ADD COMMENT

Login before adding your answer.

Traffic: 2043 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6