Question

How To Convert Blast Results To Gff

21

Entering edit mode

15.4 years ago

Michael Barton ★ 1.9k

I'd like to visualise the results of a BLAST search in a genome browser. Is there a simple way to get the results in GFF format without having to write a parser myself?

blast gff format • 26k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 15.4 years ago by Michael Barton ★ 1.9k

Ram · Answer 1 · 2010-03-12

Start with tabulated blast output myfile.blast.out. Then check two-liners from:

http://bergman-lab.blogspot.com/2009/12/ncbi-blast-tabular-output-format-fields.html

Few lines tooutput proper gff are missing, but you may either go for minimalistic gff or try to encode everything in column 9. Also you may try validating your gff3 here:

http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

NOTE: The blog linked above does not seem to exist anymore, here is the content of it from the wayback machine:

NCBI Blast Tabular output format fields

Certainly, with the new NCBI Blast+ tools, you won't need this anymore, but as long as we are sticking with the old blastall programm with its horrible documentation, I keep forgetting the format of the BLAST tabular reports. Tabular format is created when you specify -m 8. This is the most useful format for parsing blast yourself without having to learn strange libraries like BioPerl, BioJava, BioPython or BioErlang (doesn't this exist yet, Mike?)

So here is the meaning of the fields:

queryId, subjectId, percIdentity, alnLength, mismatchCount, 
gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore

Parsing is then simple

Python:

for line in open("myfile.blast"):
    (queryId, subjectId, percIdentity, alnLength, mismatchCount, 
    gapOpenCount, queryStart, queryEnd, subjectStart, 
    subjectEnd, eVal, bitScore) = line.split("\t")

Perl:

while (<>) {
   ($queryId, $subjectId, $percIdentity, $alnLength, $mismatchCount, 
   $gapOpenCount, $queryStart, $queryEnd, $subjectStart, 
   $subjectEnd, $eVal, $bitScore) = split(/\t/)
}

Ram · Answer 2 · 2010-03-13

http://jperl.googlecode.com/svn-history/r16/trunk/Blast2Gff.pl

You can use the script Pierre found swith a slight modification, actually it is a bit crude and does no real error checking but it works. The error is it does not work if the blast file has a header like this:

# BLASTN 2.2.21 [Jun-14-2009]
# Query: 16383 sequences
# Database: genomedata/GenomeOfDeath.fas
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
GeneOfDeath  GenomeOfDeath  100.00  295     0       0       1       295     152626  152920  4e-168   585

So, one should filter out lines beginning with "#" and it does no harm to skip lines which are empty or contain only white spaces. So edit the file Blast2Gff.pl: in line 149 add:

next if (/^\#/ || /^\s*$/); # filter comments and empty lines

Such that this part looks like below, then try again.

while (<BLASTIN>)
{
  next if (/^\#/ || /^\s*$/); # filter comments and empty lines

$HitNum++;

my ($QryId, $SubId, $PID, $Len, 
    $MisMatch, $GapOpen, 
    $QStart,$QEnd, $SStart, $SEnd,
    $EVal, $BitScore) = split(/\t/);

Ram · Answer 3 · 2016-01-04

3

Entering edit mode

9.6 years ago

cmdcolin ★ 4.3k

The tool bp_search2gff.pl can do this fairly well.

Install bioperl (cpanm --notest BioPerl)

Then run blast and the bp_search2gff.pl

blastn -query query.fa -db blast.fa > blast.txt
bp_search2gff.pl --input blast.txt --addid --version 3 --type hit

If you use the --match option it can group HSP too

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 9.6 years ago by cmdcolin ★ 4.3k

Ram · Answer 4 · 2010-03-12

1

Entering edit mode

15.4 years ago

Giovanni M Dall'Olio 28k

Have you tried these scripts??

Maybe the PSL format is better to represent an alignment. You can also look at the BED format so later you can play with BedTools

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I don't have bioperl on my computer and the documentation states that installing bioperl is "non-trivial" which is not encouraging. Perhaps it would be simpler just to write a parser myself.

ADD REPLY • link 15.4 years ago by Michael Barton ★ 1.9k

Ram · Answer 5 · 2015-12-04

If your genome browser of choice displays psl tracks (as does IGV, wink wink), the save your blast results as XML and convert them in two steps to psl using the Jim Kent utility blastXmlToPsl

Here's an example call:

blastXmlToPsl -pslx -tName=Hit_id -qName=query-def0  blast_hits.xml   blast_hits.psl

Else, take it another step, and generate a .bed, with

pstToBed blast_hits.psl blast_hits.bed

score 0 · Answer 6 · 2013-12-11

0

Entering edit mode

11.7 years ago

nuria • 0

Hi! I tried something similar to what Darked89 explained, but I have a problem:

The subject end is not always larger than the subject start (plus/minus), and bedtools complains:

"Error: malformed BED entry at line 3. Start was greater than end. Exiting."

Does anyone know a way to change start and end values when end is smaller?

Thank you

ADD COMMENT • link 11.7 years ago by nuria • 0

0

Entering edit mode

This should be rather a comment, not an answer to the question, to solve the issue, swap start and stop if start > stop: by some code like: ($start, $stop) = sort ($start, stop)

ADD REPLY • link 11.7 years ago by Michael 56k

Ram · Answer 7 · 2013-12-11

I solved my problem (wasn't too complicate) so I'm posting my solution here.

In case someonle else want to turn a blast tabular output into a simple bed file, but has problems with subject start values larger than end values.

I mean turn this:

scaffold01109    994    1071    HWUSI-EAS1662:3:114:11269:10361#5/2
scaffold01109    1018    1071    HWUSI-EAS1662:3:1:4638:10404#5/1
scaffold01109    1071    1009    HWUSI-EAS1662:3:23:15834:10646#5/1
scaffold01109    2289    2245    HWUSI-EAS1662:3:24:9622:2914#5/1

I got this file doing:

cat blast+tabularfile.input | awk '{print $2,$9,$10,$1}'> almostbedfile.output

into this:

scaffold01109   994     1071    HWUSI-EAS1662:3:114:11269:10361#5/2
scaffold01109   1018    1071    HWUSI-EAS1662:3:1:4638:10404#5/1
scaffold01109 1009 1071 HWUSI-EAS1662:3:23:15834:10646#5/1
scaffold01109 2245 2289 HWUSI-EAS1662:3:24:9622:2914#5/1

cat input.tsv | awk '{if ($3<$2) print  $1,$3,$2,$4; else if ($3>$2)print $0}' > output.bed

But remember to sort it before use it!