Automate finding locations of BLASTN hits in genome browser?
2
0
Entering edit mode
4.6 years ago
jamie.pike ▴ 80

I have 9 genomes which I have recently used for a BLASTN search, each has roughly 100 hits. I want to check the locations of my hits, and see if they overlap with any predicted genes in the genomes. Normally, I would do this manually loading the gff files into Artemis and searching the locations of the hits. However, as I have ~900 hits to search for I was wondering if there was a faster, easier way to do this? I have an outfmt 6 file for the BLASTN hits.

Any advice would be appreciated; even if it is for a genome browser that may make the searches easier.

Thanks

BLASTN Genome Browser • 1.4k views
ADD COMMENT
2
Entering edit mode
4.6 years ago

bedtools intersect is your weapon of choice here

you will need to have the gff files of the annotation of each genome and a bed format file of your blast hit regions. Bed format is a very simple tabulated format you can easily obtain by reformatting your blast output file.

the bedtools intersect will report the intersection of both files, being thus blast hits that overlap with genes (or even vice versa depending on the param settings)

ADD COMMENT
0
Entering edit mode

Thank you, I will have a look into the bedtools intersect!

ADD REPLY
2
Entering edit mode
4.6 years ago

Another approach is to convert your BLASTN hits to sorted BED, using awk and sort-bed:

$ awk -v FS="\t" -v OFS="\t" '{print($2, ($7-1), $8, $1, $11)}' blastHits.mtx | sort-bed - > blastHits.bed

Use gff2bed to make a sorted BED file from GENCODE annotations, e.g., assuming you are working in hg38:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.basic.annotation.gff3.gz \
    | gunzip -c - \
    | gff2bed - \
    | grep -w gene - \
    > gencode.v33.genes.bed

Repeat for other genomes/sources of annotations.

Then map the BLAST hits to genes with bedmap:

$ bedmap --echo --echo-map-id-uniq --skip-unmapped --delim '\t' blastHits.bed gencode.v33.genes.bed > answer.bed

The file answer.bed will contain each BLAST hit that has overlaps with one or more GENCODE gene annotations, along with their IDs. Hits without overlapping genes are left out of the result; if those are needed, just take out the --skip-unmapped option.

You'd repeat the map step for the other eight of your nine genomes/assemblies (assuming I am reading your question correctly).

If you'd want to automate that, you'd need to do some scripting work, i.e., using a for loop or its equivalent in whatever language you're most comfortable in (Python, bash, etc.). Genome browsers probably won't help here, as you'd still need to manually load hits in for each genome, separately.

ADD COMMENT
0
Entering edit mode

Thank you for your detailed answer. I was wondering if gff2bed specifically requires GENCODE annotations? I am working with fungal genomes, and so far (I have very limited bioinformatic experience), I have not come across GENCODE. After a quick google, it seems like it is for human and mouse genomes?

ADD REPLY
1
Entering edit mode

GTF is just a format. This isn't specific to GENCODE. A lot of questions here are human or mouse specific and so that's why I mentioned GENCODE as an example. If you have access to GTF files for your genome, perhaps from some consortium working with and publishing fungal genome data, they should work fine.

ADD REPLY

Login before adding your answer.

Traffic: 2316 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6