I'd like to visualise the results of a BLAST search in a genome browser. Is there a simple way to get the results in GFF format without having to write a parser myself?
I'd like to visualise the results of a BLAST search in a genome browser. Is there a simple way to get the results in GFF format without having to write a parser myself?
Start with tabulated blast output myfile.blast.out. Then check two-liners from:
http://bergman-lab.blogspot.com/2009/12/ncbi-blast-tabular-output-format-fields.html
Few lines tooutput proper gff are missing, but you may either go for minimalistic gff or try to encode everything in column 9. Also you may try validating your gff3 here:
http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
NOTE: The blog linked above does not seem to exist anymore, here is the content of it from the wayback machine:
NCBI Blast Tabular output format fields
Certainly, with the new NCBI Blast+ tools, you won't need this anymore, but as long as we are sticking with the old blastall programm with its horrible documentation, I keep forgetting the format of the BLAST tabular reports. Tabular format is created when you specify -m 8
. This is the most useful format for parsing blast yourself without having to learn strange libraries like BioPerl, BioJava, BioPython or BioErlang (doesn't this exist yet, Mike?)
So here is the meaning of the fields:
queryId, subjectId, percIdentity, alnLength, mismatchCount,
gapOpenCount, queryStart, queryEnd, subjectStart, subjectEnd, eVal, bitScore
Parsing is then simple
Python:
for line in open("myfile.blast"):
(queryId, subjectId, percIdentity, alnLength, mismatchCount,
gapOpenCount, queryStart, queryEnd, subjectStart,
subjectEnd, eVal, bitScore) = line.split("\t")
Perl:
while (<>) {
($queryId, $subjectId, $percIdentity, $alnLength, $mismatchCount,
$gapOpenCount, $queryStart, $queryEnd, $subjectStart,
$subjectEnd, $eVal, $bitScore) = split(/\t/)
}
http://jperl.googlecode.com/svn-history/r16/trunk/Blast2Gff.pl
You can use the script Pierre found swith a slight modification, actually it is a bit crude and does no real error checking but it works. The error is it does not work if the blast file has a header like this:
# BLASTN 2.2.21 [Jun-14-2009]
# Query: 16383 sequences
# Database: genomedata/GenomeOfDeath.fas
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
GeneOfDeath GenomeOfDeath 100.00 295 0 0 1 295 152626 152920 4e-168 585
So, one should filter out lines beginning with "#" and it does no harm to skip lines which are empty or contain only white spaces. So edit the file Blast2Gff.pl: in line 149 add:
next if (/^\#/ || /^\s*$/); # filter comments and empty lines
Such that this part looks like below, then try again.
while (<BLASTIN>)
{
next if (/^\#/ || /^\s*$/); # filter comments and empty lines
$HitNum++;
my ($QryId, $SubId, $PID, $Len,
$MisMatch, $GapOpen,
$QStart,$QEnd, $SStart, $SEnd,
$EVal, $BitScore) = split(/\t/);
The tool bp_search2gff.pl
can do this fairly well.
Install bioperl (cpanm --notest BioPerl
)
Then run blast
and the bp_search2gff.pl
blastn -query query.fa -db blast.fa > blast.txt
bp_search2gff.pl --input blast.txt --addid --version 3 --type hit
If you use the --match
option it can group HSP too
Have you tried these scripts??
Maybe the PSL format is better to represent an alignment. You can also look at the BED format so later you can play with BedTools
If your genome browser of choice displays psl tracks (as does IGV, wink wink), the save your blast results as XML and convert them in two steps to psl using the Jim Kent utility blastXmlToPsl
Here's an example call:
blastXmlToPsl -pslx -tName=Hit_id -qName=query-def0 blast_hits.xml blast_hits.psl
Else, take it another step, and generate a .bed, with
pstToBed blast_hits.psl blast_hits.bed
Hi! I tried something similar to what Darked89 explained, but I have a problem:
The subject end is not always larger than the subject start (plus/minus), and bedtools complains:
"Error: malformed BED entry at line 3. Start was greater than end. Exiting."
Does anyone know a way to change start and end values when end is smaller?
Thank you
I solved my problem (wasn't too complicate) so I'm posting my solution here.
In case someonle else want to turn a blast tabular output into a simple bed file, but has problems with subject start values larger than end values.
I mean turn this:
scaffold01109 994 1071 HWUSI-EAS1662:3:114:11269:10361#5/2
scaffold01109 1018 1071 HWUSI-EAS1662:3:1:4638:10404#5/1
scaffold01109 1071 1009 HWUSI-EAS1662:3:23:15834:10646#5/1
scaffold01109 2289 2245 HWUSI-EAS1662:3:24:9622:2914#5/1
I got this file doing:
cat blast+tabularfile.input | awk '{print $2,$9,$10,$1}'> almostbedfile.output
into this:
scaffold01109 994 1071 HWUSI-EAS1662:3:114:11269:10361#5/2
scaffold01109 1018 1071 HWUSI-EAS1662:3:1:4638:10404#5/1
scaffold01109 1009 1071 HWUSI-EAS1662:3:23:15834:10646#5/1
scaffold01109 2245 2289 HWUSI-EAS1662:3:24:9622:2914#5/1
cat input.tsv | awk '{if ($3<$2) print $1,$3,$2,$4; else if ($3>$2)print $0}' > output.bed
But remember to sort it before use it!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The blog post referred to above has moved to:
http://bergmanlab.smith.man.ac.uk/?p=41
Updated again :) http://bergmanlab.genetics.uga.edu/?p=41
Using the examples this was relatively simple to do. The GFF3 validator helped too.