You could get gene annotations for your assembly of Arabidopsis and convert them to BED with BEDOPS gff2bed
:
$ wget -qO- https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff | awk '$3=="gene"' | gff2bed - > TAIR10.genes.bed
With TAIR10.genes.bed
in hand, you can map your coordinates to these genes and get their ID.
Here's an example of ad-hoc region being mapped with BEDOPS bedmap
:
$ echo -e 'Chr1\t20259962\t20260494' | bedmap --echo --echo-map-id --delim '\t' - TAIR10.genes.bed
Chr1 20259962 20260494 AT1G54270
Notice the difference in chromosome names, here using the TAIR convention. To do multiple regions, you'll need to take your regions and get their chromosome names "TAIR-compatible" for querying, as well as sorting them with BEDOPS sort-bed
:
$ tail -n+2 introns.txt | awk -vFS="\t" -vOFS="\t" '{ print "Chr"$0 }' | sort-bed - > introns.bed
Then you can map the sorted, BED-formatted introns:
$ bedmap --echo --echo-map-id --delim '\t' introns.bed TAIR10.genes.bed > answer.bed
If an intron overlaps multiple genes, then the ID column will contain a comma-delimited list of those gene names.
Otherwise, there will be either one name (for one overlap) or nothing (in the case of no overlaps). If you want to filter out introns that do not overlap a gene annotation, one can add the --skip-unmapped
option:
$ bedmap --echo --echo-map-id --delim '\t' --skip-unmapped introns.bed TAIR10.genes.bed > answer.bed
What exactly went wrong? Did you try all the solutions in that thread?