Question

Automate finding locations of BLASTN hits in genome browser?

0

Entering edit mode

4.6 years ago

jamie.pike ▴ 80

I have 9 genomes which I have recently used for a BLASTN search, each has roughly 100 hits. I want to check the locations of my hits, and see if they overlap with any predicted genes in the genomes. Normally, I would do this manually loading the gff files into Artemis and searching the locations of the hits. However, as I have ~900 hits to search for I was wondering if there was a faster, easier way to do this? I have an outfmt 6 file for the BLASTN hits.

Any advice would be appreciated; even if it is for a genome browser that may make the searches easier.

Thanks

BLASTN Genome Browser • 1.4k views

ADD COMMENT • link updated 4.6 years ago by Alex Reynolds 36k • written 4.6 years ago by jamie.pike ▴ 80

score 2 · Accepted Answer · 2020-04-14

2

Entering edit mode

4.6 years ago

lieven.sterck 15k

bedtools intersect is your weapon of choice here

you will need to have the gff files of the annotation of each genome and a bed format file of your blast hit regions. Bed format is a very simple tabulated format you can easily obtain by reformatting your blast output file.

the bedtools intersect will report the intersection of both files, being thus blast hits that overlap with genes (or even vice versa depending on the param settings)

ADD COMMENT • link 4.6 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you, I will have a look into the bedtools intersect!

ADD REPLY • link 4.6 years ago by jamie.pike ▴ 80

score 2 · Accepted Answer · 2020-04-14

Another approach is to convert your BLASTN hits to sorted BED, using awk and sort-bed:

$ awk -v FS="\t" -v OFS="\t" '{print($2, ($7-1), $8, $1, $11)}' blastHits.mtx | sort-bed - > blastHits.bed

Use gff2bed to make a sorted BED file from GENCODE annotations, e.g., assuming you are working in hg38:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.basic.annotation.gff3.gz \
    | gunzip -c - \
    | gff2bed - \
    | grep -w gene - \
    > gencode.v33.genes.bed

Repeat for other genomes/sources of annotations.

Then map the BLAST hits to genes with bedmap:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped --delim '\t' blastHits.bed gencode.v33.genes.bed > answer.bed

The file answer.bed will contain each BLAST hit that has overlaps with one or more GENCODE gene annotations, along with their IDs. Hits without overlapping genes are left out of the result; if those are needed, just take out the --skip-unmapped option.

You'd repeat the map step for the other eight of your nine genomes/assemblies (assuming I am reading your question correctly).

If you'd want to automate that, you'd need to do some scripting work, i.e., using a for loop or its equivalent in whatever language you're most comfortable in (Python, bash, etc.). Genome browsers probably won't help here, as you'd still need to manually load hits in for each genome, separately.