Intron and exon coordinates of all genes from Hg19
3
1
Entering edit mode
5.0 years ago
yliueagle ▴ 290

Is there a way to get Intron and exon coordinates of all genes from Hg19? The exons can be got from this post: Exon Coordinates Of Hg19 Genome Download, but how can I also get intron coordinates? I would keep the corresponding gene names in the result

Thanks!

gene intron exon cordinates • 5.6k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
4
Entering edit mode
5.0 years ago
vkkodali_ncbi ★ 3.8k

Here's a way to extract exon and intron features from GFF3 files in BED format. I am using the RefSeq's latest GRCh37 (hg19) annotation but I'd expect the following code to work (may be with minimal changes) with any other GFF3 file.

I make a distinction between 'all' and 'unique' features as follows. Let's consider the following gene:

GENE            ######################################################
ISOFORM-1       #########--------#######-------####----------#########
                E1.1             E1.2          E1.3          E1.4
                         I1.1           I1.2       I1.3
ISOFORM-2       #########--------###########---------###-----#########
                E2.1             E2.2                E2.3    E2.4
                         I2.1               I2.2        I2.3

In my 'all' output files, each and every exon and intron will be returned. In my 'uniq' output files, identical features are omitted. So, the 'uniq' exon file will include only one of E1.1/E2.1 and E1.4/E2.4, and the uniq intron file will include only one of I1.1/I2.1.

## download GFF3 file
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz

## extract all exons
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -k1,1 -k2,2n > hg19_exons.all.bed 

## extract unique exons  
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*/\1/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_exons.uniq.bed 

## extract all introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.all.bed 

## extract unique introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{split($4,a,"|"); $4=a[1]; print $0}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.uniq.bed
ADD COMMENT
1
Entering edit mode
5.0 years ago
$ curl  -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c | head | awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s,EXON\n",$1,$2,$3,S[i],E[i]); if(i+1<=n) printf("%s,%s,%s,%s,%s,INTRON\n",$1,$2,$3,int(E[i]),int(S[i+1]));  }}' 
uc001aaa.3,chr1,+,11873,12227,EXON
uc001aaa.3,chr1,+,12227,12612,INTRON
uc001aaa.3,chr1,+,12612,12721,EXON
uc001aaa.3,chr1,+,12721,13220,INTRON
uc001aaa.3,chr1,+,13220,14409,EXON
uc010nxr.1,chr1,+,11873,12227,EXON
uc010nxr.1,chr1,+,12227,12645,INTRON
uc010nxr.1,chr1,+,12645,12697,EXON
uc010nxr.1,chr1,+,12697,13220,INTRON
uc010nxr.1,chr1,+,13220,14409,EXON
uc010nxq.1,chr1,+,11873,12227,EXON
uc010nxq.1,chr1,+,12227,12594,INTRON
uc010nxq.1,chr1,+,12594,12721,EXON
uc010nxq.1,chr1,+,12721,13402,INTRON
uc010nxq.1,chr1,+,13402,14409,EXON
uc009vis.3,chr1,-,14361,14829,EXON
uc009vis.3,chr1,-,14829,14969,INTRON
uc009vis.3,chr1,-,14969,15038,EXON
uc009vis.3,chr1,-,15038,15795,INTRON
uc009vis.3,chr1,-,15795,15942,EXON
uc009vis.3,chr1,-,15942,16606,INTRON
uc009vis.3,chr1,-,16606,16765,EXON
uc009vit.3,chr1,-,14361,14829,EXON
uc009vit.3,chr1,-,14829,14969,INTRON
(...)
ADD COMMENT
0
Entering edit mode
5.0 years ago
Ram 44k

You can use R to add 1 to the 5th field and subtract 1 from the 4th field in the next line (as long as gene identifier matches) in the curl solution from that post. You may need a custom function.

Or you can create a bed file from the exon coordinates, another bed file with one line per chromosome, with start = min(all exon start coordinates for that chr) and end=max(all exon end coordinates for that chr). You can then use bedtools or bedops to subtract the exon bed from the other bed to get all introns. I'd personally go with this solution as it would be easier to implement.

ADD COMMENT

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6