Another approach is to convert your BLASTN hits to sorted BED, using awk
and sort-bed
:
$ awk -v FS="\t" -v OFS="\t" '{print($2, ($7-1), $8, $1, $11)}' blastHits.mtx | sort-bed - > blastHits.bed
Use gff2bed
to make a sorted BED file from GENCODE annotations, e.g., assuming you are working in hg38
:
$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.basic.annotation.gff3.gz \
| gunzip -c - \
| gff2bed - \
| grep -w gene - \
> gencode.v33.genes.bed
Repeat for other genomes/sources of annotations.
Then map the BLAST hits to genes with bedmap
:
$ bedmap --echo --echo-map-id-uniq --skip-unmapped --delim '\t' blastHits.bed gencode.v33.genes.bed > answer.bed
The file answer.bed
will contain each BLAST hit that has overlaps with one or more GENCODE gene annotations, along with their IDs. Hits without overlapping genes are left out of the result; if those are needed, just take out the --skip-unmapped
option.
You'd repeat the map step for the other eight of your nine genomes/assemblies (assuming I am reading your question correctly).
If you'd want to automate that, you'd need to do some scripting work, i.e., using a for
loop or its equivalent in whatever language you're most comfortable in (Python, bash, etc.). Genome browsers probably won't help here, as you'd still need to manually load hits in for each genome, separately.
Thank you, I will have a look into the bedtools intersect!