Convert intron coordinates to gene names
3
0
Entering edit mode
5.3 years ago

I have a huge list of intron coordinates present in the Arabidopsis thaliana genome, I want to get the gene names based on intron coordinates.

Sampe itron coordinates:

chr start   end
1   20259962    20260494
1   20634981    20635060
1   18880866    18881011

Desired output:

chr start   end Name
1   20259962    20260494    AT1G54270
1   20634981    20635060    AT1G55320
1   18880866    18881011    AT1G50950

This thread (Find Out The Genes That Correspond To My Coordinates) is very close but I wasn't able to get the desired output.

RNA-Seq sqtl • 2.1k views
ADD COMMENT
0
Entering edit mode

but I wasn't able to get the desired output.

What exactly went wrong? Did you try all the solutions in that thread?

ADD REPLY
3
Entering edit mode
5.3 years ago

You could get gene annotations for your assembly of Arabidopsis and convert them to BED with BEDOPS gff2bed:

$ wget -qO- https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff | awk '$3=="gene"' | gff2bed - > TAIR10.genes.bed

With TAIR10.genes.bed in hand, you can map your coordinates to these genes and get their ID.

Here's an example of ad-hoc region being mapped with BEDOPS bedmap:

$ echo -e 'Chr1\t20259962\t20260494' | bedmap --echo --echo-map-id --delim '\t' - TAIR10.genes.bed 
Chr1    20259962    20260494    AT1G54270

Notice the difference in chromosome names, here using the TAIR convention. To do multiple regions, you'll need to take your regions and get their chromosome names "TAIR-compatible" for querying, as well as sorting them with BEDOPS sort-bed:

$ tail -n+2 introns.txt | awk -vFS="\t" -vOFS="\t" '{ print "Chr"$0 }' | sort-bed - > introns.bed

Then you can map the sorted, BED-formatted introns:

$ bedmap --echo --echo-map-id --delim '\t' introns.bed TAIR10.genes.bed > answer.bed

If an intron overlaps multiple genes, then the ID column will contain a comma-delimited list of those gene names.

Otherwise, there will be either one name (for one overlap) or nothing (in the case of no overlaps). If you want to filter out introns that do not overlap a gene annotation, one can add the --skip-unmapped option:

$ bedmap --echo --echo-map-id --delim '\t' --skip-unmapped introns.bed TAIR10.genes.bed > answer.bed
ADD COMMENT
1
Entering edit mode
5.3 years ago

use bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html to build a BED file of intron + name.

$ wget -q  -O - "ftp://ftp.ensemblgenomes.org/pub/release-44/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.44.gtf.gz" |\
gunzip -c |\
java -jar dist/bioalcidaejdk.jar -F GTF -e 'stream().flatMap(G->G.getTranscripts().stream()).flatMap(T->T.getIntrons().stream()).forEach(I->println(I.getContig()+"\t"+(I.getStart()-1)+"\t"+I.getEnd()+"\t"+I.getTranscript().getGene().getGeneName()));' |\
sort -t $'\t' -k1,1 -k2,2n | uniq > introns.bed

$ head introns.bed
1   3913    3995    NAC001
1   4276    4485    NAC001
1   4605    4705    NAC001
1   5095    5173    NAC001
1   5326    5438    NAC001
1   7069    7156    ARV1
1   7232    7383    ARV1
1   7450    7563    ARV1
1   7649    7761    ARV1
1   7649    8235    ARV1

then use bedtools intersect to get the intersection with your files.

ADD COMMENT
1
Entering edit mode
5.3 years ago
ATpoint 86k

If you have a GTF file simply filter for entries with the "gene" feature and then use bedtools closest which will return 0 as distance in case of an overlap. From there you can awk out the format you want. It is unfortunate you have no strand information, so you will have multiple genes sometimes per intron.

ADD COMMENT

Login before adding your answer.

Traffic: 2091 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6