Question

Extracting blast hit with upstream and downstream neighbourhood

0

Entering edit mode

5.0 years ago

agata ▴ 10

I would like to implement BLAST module into my bash-based pipeline.

Based on multi-fasta query and a genome of interest I run BLAST (managed with makeblastdb for genome and then blastn with output as a table). The next step is to extract hits with 100bp upstream and downstream and save them as fasta file. As far, I rearrange the output table from BLAST to get new sstart and send positions, as well as the information about strand orientation (3 last columns).

After rearrangement the table looks as follow:

queryid subjectid   identity    sstart  send    strand  new_sstart  new_send
MRP CM008973.1  100 6219401 6219677 plus    6219301 6219777
SRP CM008972.1  100 1943089 1942802 minus   1943189 1942702

As @lieven.sterck suggested I may use now blastdbcmd.

If I understand correctly, it is possible to fetch the table with entry_batch

I modified the table to get format as required by command:

ID  range   strand
CM008973.1  6219301-6219777 plus
CM008972.1  1943189-1942702 minus

But when I use the blastdbcmd -db ./database/database.fna \ -entry_batch ./test.txt I get error:

Error: MRP: OID not found
Error: SRP: OID not found

BLAST bash blast+ • 1.7k views

ADD COMMENT • link 5.0 years ago by agata ▴ 10

0

Entering edit mode

blastdbcmd -entry_batch is not compatible with range. Which is what you are looking to do by providing a start and stop. You will need to extract these entries individually.

 -entry_batch <File_In>
   Input file for batch processing (Format: one entry per line, seq id 
   followed by optional space-delimited specifier(s)
   [range|strand|mask_algo_id]
    * Incompatible with:  entry, **range**, strand, mask_sequence_with, ipg,
   ipg_batch, taxids, taxidlist, info, tax_info, list, recursive,

ADD REPLY • link 5.0 years ago by GenoMax 150k

0

Entering edit mode

Okey, so I understand I would need to put the blastdbcmd command into a loop over the rows of a table? Am I right?

Then my table should looks like this?

6219301-621977|plus|CM008973.1
1943189-1942702|minus|CM008972.1

BTW Primarly I was suggested by this:

entry-batch (T. C10) - Blastdbcmd
Input file for batch processing. The format requires one entry per line; each line should begin with the sequence ID 
followed by any of the following optional specifiers (in any order): range (format: ‘from-to’, inclusive in 1-offsets), 
strand (‘plus’ or ‘minus’), or masking algorithm ID (integer value representing the available masking algorithm). 
Omitting the ending range (e.g.: ‘10-‘) is supported, but there should not be any spaces around the ‘-‘.

And that's why I modified the table to have those 3 columns: ID, range and strand.

ADD REPLY • link 5.0 years ago by agata ▴ 10

0

Entering edit mode

A command like following should work and would be straight forward :

blastdbcmd -db ./database/database.fna -entry CM008973.1 -range 6219301-621977 -strand plus -outfmt %f -out filename

Edit: Based on @Lieven's comment below (and program message you posted above) -entry_batch may be supported in spite of what the in-line help says. Your input file does not seem to be in the right format though.

ADD REPLY • link 5.0 years ago by GenoMax 150k

0

Entering edit mode

as @genomax already mentions, not sure if this will works as you hope.

in any case, your lines need to start with the hit ID (subject ID). The error you report does not fit with the file you described as input (the MRP and SRP seem to be your query IDs, not your hit IDs)

ADD REPLY • link 5.0 years ago by lieven.sterck 15k

score 0 · Answer 1 · 2020-04-14

why would you use two different version of the blastDB formatting? formatdb is deprecated for years now (was part of older blast distributions) ... currently only makeblastdb is to be used.

Moreover, there is no reason to format your query file, you can use that in plain fasta format

Suggestion is to use the outfmt 6 or 7 of blast (== the tabular output formats) , from those you can select the start and stop of a certain blast hit region. Once you have that , you can for instance use blastdbcmd command from blast package, to extract a range from your blastDB.

For the orientation part: you can deduct this from the coordinates blast reports: if they are small -> large :forward strand , large -> small : negative strand