Question

Intron and exon coordinates of all genes from Hg19

1

Entering edit mode

5.3 years ago

yliueagle ▴ 290

Is there a way to get Intron and exon coordinates of all genes from Hg19? The exons can be got from this post: Exon Coordinates Of Hg19 Genome Download, but how can I also get intron coordinates? I would keep the corresponding gene names in the result

Thanks!

gene intron exon cordinates • 6.0k views

ADD COMMENT • link updated 5.3 years ago by vkkodali_ncbi ★ 3.8k • written 5.3 years ago by yliueagle ▴ 290

0

Entering edit mode

If you have an annotation file:

A: how to get intronic and intergenic sequences based on gff file?

ADD REPLY • link 5.3 years ago by ATpoint 87k

score 4 · Answer 1 · 2019-12-10

Here's a way to extract exon and intron features from GFF3 files in BED format. I am using the RefSeq's latest GRCh37 (hg19) annotation but I'd expect the following code to work (may be with minimal changes) with any other GFF3 file.

I make a distinction between 'all' and 'unique' features as follows. Let's consider the following gene:

GENE            ######################################################
ISOFORM-1       #########--------#######-------####----------#########
                E1.1             E1.2          E1.3          E1.4
                         I1.1           I1.2       I1.3
ISOFORM-2       #########--------###########---------###-----#########
                E2.1             E2.2                E2.3    E2.4
                         I2.1               I2.2        I2.3

In my 'all' output files, each and every exon and intron will be returned. In my 'uniq' output files, identical features are omitted. So, the 'uniq' exon file will include only one of E1.1/E2.1 and E1.4/E2.4, and the uniq intron file will include only one of I1.1/I2.1.

## download GFF3 file
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz

## extract all exons
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -k1,1 -k2,2n > hg19_exons.all.bed 

## extract unique exons  
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id=/)' \
  | cut -f1,4,5,7,9 \
  | perl -pe 's/ID.*gene=([^;]*).*/\1/g' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2-1,$3,$5,"1000",$4}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_exons.uniq.bed 

## extract all introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.all.bed 

## extract unique introns 
zgrep -v '^#' GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
  | awk 'BEGIN{FS="\t";OFS="\t"}($3=="exon" && $9~/transcript_id/)' \
  | perl -pe 's/ID.*gene=([^;]*).*transcript_id=([^;]*).*/\1|\2/g' \
  | cut -f1,4,5,7,9 | sort -k1,1 -k5,5 -k2,2n -k3,3n \
  | awk 'BEGIN{FS="\t";OFS="\t";chr=0;d=0;rs="NA"}{if(chr==$1 && rs==$5){print $1,d,$2-1,rs,0,$4; d=$3}else{chr=$1;rs=$5;d=$3}}' \
  | awk 'BEGIN{FS="\t";OFS="\t"}{split($4,a,"|"); $4=a[1]; print $0}' \
  | sort -u | sort -k1,1 -k2,2n > hg19_introns.uniq.bed

score 1 · Answer 2 · 2019-12-10

$ curl  -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c | head | awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s,EXON\n",$1,$2,$3,S[i],E[i]); if(i+1<=n) printf("%s,%s,%s,%s,%s,INTRON\n",$1,$2,$3,int(E[i]),int(S[i+1]));  }}' 
uc001aaa.3,chr1,+,11873,12227,EXON
uc001aaa.3,chr1,+,12227,12612,INTRON
uc001aaa.3,chr1,+,12612,12721,EXON
uc001aaa.3,chr1,+,12721,13220,INTRON
uc001aaa.3,chr1,+,13220,14409,EXON
uc010nxr.1,chr1,+,11873,12227,EXON
uc010nxr.1,chr1,+,12227,12645,INTRON
uc010nxr.1,chr1,+,12645,12697,EXON
uc010nxr.1,chr1,+,12697,13220,INTRON
uc010nxr.1,chr1,+,13220,14409,EXON
uc010nxq.1,chr1,+,11873,12227,EXON
uc010nxq.1,chr1,+,12227,12594,INTRON
uc010nxq.1,chr1,+,12594,12721,EXON
uc010nxq.1,chr1,+,12721,13402,INTRON
uc010nxq.1,chr1,+,13402,14409,EXON
uc009vis.3,chr1,-,14361,14829,EXON
uc009vis.3,chr1,-,14829,14969,INTRON
uc009vis.3,chr1,-,14969,15038,EXON
uc009vis.3,chr1,-,15038,15795,INTRON
uc009vis.3,chr1,-,15795,15942,EXON
uc009vis.3,chr1,-,15942,16606,INTRON
uc009vis.3,chr1,-,16606,16765,EXON
uc009vit.3,chr1,-,14361,14829,EXON
uc009vit.3,chr1,-,14829,14969,INTRON
(...)

score 0 · Answer 3 · 2019-12-10

You can use R to add 1 to the 5th field and subtract 1 from the 4th field in the next line (as long as gene identifier matches) in the curl solution from that post. You may need a custom function.

Or you can create a bed file from the exon coordinates, another bed file with one line per chromosome, with start = min(all exon start coordinates for that chr) and end=max(all exon end coordinates for that chr). You can then use bedtools or bedops to subtract the exon bed from the other bed to get all introns. I'd personally go with this solution as it would be easier to implement.