How to start with blast and a local database
3
0
Entering edit mode
6.5 years ago
gbl1 ▴ 80

Hi,

I have a "complete" genome of a plant in a fasta file (16Go) I need to find some part of it, so I use blast. My first issue, when creating a database, it only accept 1Go… meaning most of my reads are not considered. How to proceed to get all my reads in a given database. Then, If I blast a sequence from another species in my db, how to get the output in a fasta file? By the way, is it possible to use a local blast through a browser, looking like NCBI, instead of a terminal? Eventually, is it possible to obtain a consensus sequence from those sequences? Or doing a local assembly.

Thanks in advance

Assembly blast genome • 6.9k views
ADD COMMENT
0
Entering edit mode

Can you clarify what you mean with 'Go'?

ADD REPLY
1
Entering edit mode

giga octet?

Is that obsolete word?

ADD REPLY
3
Entering edit mode

I don't think so ... though somewhat french-centric ;)

just not very common (if existing at all) in english language , where it's apparently even translated as 'gigabyte'

ADD REPLY
0
Entering edit mode

I'm guessing he means 'Gb'

ADD REPLY
0
Entering edit mode

Just answering this part:

" is it possible to use a local blast through a browser"

Yes you can.

here is the tool: https://openwetware.org/wiki/Wwwblast

bonus: NCBI workbench but it have more than blast. https://www.ncbi.nlm.nih.gov/tools/gbench/

ADD REPLY
0
Entering edit mode

I have a "complete" genome of a plant in a fasta file (16Go)

Is this "complete" plant genome an assembled genome? Or raw sequencing reads?

ADD REPLY
5
Entering edit mode
6.5 years ago

To answer one part of your question, there is a graphical blast web client which can be easily installed locally using docker. It is still a bit buggy though. It may also be able to index your sequence properly, not sure what your issue is there really.

https://github.com/wurmlab/sequenceserver

ADD COMMENT
0
Entering edit mode

installation failed...

ADD REPLY
1
Entering edit mode

That was not really informative enough to help any further. Just a recommendation anyway.

ADD REPLY
0
Entering edit mode

Thanks @colindaven for recommending sequenceserver as a BLAST interface. Hi @gbl1 - it looks like you struggled with one of the other recommendations as well. If you can provide some details as to how the installation failed, perhaps we can help you get sequenceserver running... Note that the github repository is more for developers and the normal install mechanisms - using ruby or docker are at http://sequenceserver.com Please let us know!

ADD REPLY
1
Entering edit mode
6.5 years ago
gb ★ 2.2k

Lots of questions, maybe you should break it down. There is also an other post that has the same question about the webblast. A: local web BLAST server

If the reference genome is to big (your pc has not enough memory) you could split your reference plant genome and make more smaller databases. To BLAST against everything you can make a .nal file https://www.ncbi.nlm.nih.gov/books/NBK279693/. You can also add all the database in de command like.

blastn -query query.fa -db "part1 part2 part3" -out output

Here you can see how you can get the output in a fasta file: Blast+ results file parsing to fasta file you need to use a custom output and format it in a fasta format afterwards.

Not sure about the consensus part, maybe you should use something like bowtie or BWA to map and use samtools to make a consensus. With blast you only have two sequences the target and the query. For a consensus you need at least 3?

ADD COMMENT
1
Entering edit mode

As complementary to the answer, here is a small article regarding the size of rams and building blast db:

http://etutorials.org/Misc/blast/Part+IV+Industrial-Strength+BLAST/Chapter+11.+BLAST+Databases/11.2+BLAST+Databases/

ADD REPLY
0
Entering edit mode

That is a nice page. I am not a big fan of how ncbi explains BLAST

ADD REPLY
1
Entering edit mode
6.5 years ago
gbl1 ▴ 80

Hi,

So, let me tell you how it went:

-First, I found a small program called "fasta-splitter.pl" that helped me having smaller and suitable fasta files.

-I could make 6 databases, meaning: progress, but the 10 more are still missing…

-making an alias DB failed…

-for getting close enough to fasta: blastn -query file.txt -outfmt '6 qseqid sseqid sseq evalue' -db "db1 db2 db3 …" has worked

I about dead by now…

ADD COMMENT
0
Entering edit mode

Good to hear that you made progress, also you are getting more familiar with blast, tools and the command line. It is always hard to start with new things like this. And yes it is exhausting sometimes. Keep up the work, you can do it =)

ADD REPLY
0
Entering edit mode

I am familiar with commande lines… but… I need to get what are the arguments to use… thats another issue!

blastn -query file.txt -outfmt '6 qseqid sseqid sseq evalue' -db "db1 db2 db3 …" is not what I look for: it only returns the aligned sequence… My aim is to know what just outside! No point in computing what I already know!

Is there any arguments of blast that would return all the full sequences that match?

ADD REPLY
1
Entering edit mode

I do not see that option in the outfmt possibilities, you can write a small script for that. But this page did not helped you? Blast+ results file parsing to fasta file

ADD REPLY
0
Entering edit mode

I read it several time… It does not do what I want for it only gives a fasta with the part of the sequence that are matching… technically, it gives me only what I allready have!

ADD REPLY
0
Entering edit mode

oh I see it now. You could do something like this in python:

from Bio import SeqIO
hits = set()
with open("blastoutput", "r") as blastoutput:
    for hit in blastoutput:
        hits.add(hit.split("\t")[0])

with open("fastafile", "r") as fasta, open("output.fa", "a") as output:
    for record in SeqIO.parse(fasta, "fasta"):
        if strrecord.id) in hits:
            output.write(">"+str(record.description)+"\n")
            output.write(">"+str(record.seq)+"\n")

Think it is not the cleanest way, but just give an example on top of my head

ADD REPLY
0
Entering edit mode

looks cool... how to use that script? where do I tell blast the db to use? and to tell my input file?

ADD REPLY
0
Entering edit mode

Blast will never return anything else than aligned sequences (actually alligned HSPs not even sequences). If you want to get the full sequences you will need to parse the blast result, get the hit IDs and extract those from the blastDB(s) you used.

blastdbcmd -db" db1 db2 db3" -entry <hit ID>

After which you can then do a global aligned of them using a different software

ADD REPLY
0
Entering edit mode

In addition to this:

#!/bin/sh

awk -F "\t"{print $1}' blastoutput.txt  | 
while read line; do
    blastdbcmd  -db" db1 db2 db3" -entry $line
done
ADD REPLY
1
Entering edit mode

or simply use the -entry_batch option, which takes a file with hit IDs as input

ADD REPLY

Login before adding your answer.

Traffic: 1818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6