Hi,
I was struggling to find an intronic GTF for some RNA-seq alignment. Several page showed how to get an intronic bed, but I couldn't find a way to convert it into a GTF just like the exonic GTF, with "gene" and "transcript" parent structure. Finally I was able to find some tools and adjusted some commands to get what I wanted. Here, I am sharing my method, so others can comment if I missed something or use these commands to make intronic GTF for there own exonic GTF. I have visualized my output GTF and it looks what I wanted. For single exonic genes, you end up with a transcript with no exons, so I assume it will give zero counts in a STAR/HISAT2 run where exons are required.
Step 1: First, download you GTF of interest and unzip it. I am using the GENCODE GRCh38 p13 release 33 comprehensive GTF (PRI). I have sorted it, but don't know if that is essential. Extract the transcript information from it into a bed file. You need convert2bed from BEDOPS and subtract from BEDtools.
awk 'BEGIN{FS=OFS="\t"} $3 == "transcript"' gencdH38p13r33.sorted.gtf | convert2bed -i gtf - > trans.bed
Step 2: Get exons information in bed format from same .gtf
awk 'BEGIN{FS=OFS="\t"} $3 == "exon"' gencdH38p13r33.sorted.gtf | convert2bed -i gtf - > exons.bed
The exons include CDS and UTRs.
Step 3: Subtract exons from transcripts to get intron segment information from transcripts in bed format.
bedtools subtract -a trans.bed -b exons.bed > introns.bed
Step 4: Convert introns.bed to GTF format :
awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' introns.bed > intron.gtf
Step 5: Replace "transcript" in column 3 with "exons" :
awk 'BEGIN{FS=OFS="\t"} {gsub("transcript","exon",$3);print}' intron.gtf > intronb.gtf
Step 6: Extract lines specifiying "gene" or "transcript" information from original gtf.
awk 'BEGIN{FS=OFS="\t"} $3 == "gene" || $3 == "transcript"' gencdH38p13r33.sorted.gtf > genetrans.gtf
Step 7: Now concatenate genetrans.gtf and intronb.gtf, this will result in final intronic gtf with gene, transcript and exon hierarchical structure. Here introns are labelled as "exons". You could also add the intronb.gtf to the full GTF, in that case, change the "transcript" label to "intron" (Step 5) and specify "intron" for your alignment.
cat genetrans.gtf intronb.gtf > intronic.gtf
Voila!!! Check it in IGV and compare with original GTF to see if it worked. I was able to get introns for overlapping genes, even if they were on same or opposite strands. The code did give two errors during subtract steps, but my random gene check didn't give an unwanted results. Please let me know your feedback. Hope this is useful!
Piyush
Hi Juke,
Thanks. I actually wanted a GTF file which I can use for to count intronic reads for all genes, just like normal alignment for exons. I didn't know that intron is not a valid specification for GTF, so thanks for pointing that out. I also wanted to have "gene_id" and "transcript_id" information associated with each introns, so I can have intronic transcripts for each transcript, which I was able to get following the way I posted. I didn't post it in question as I had already figured out what I wanted and just wanted to share the pipeline.
Also thanks for your solution and feedback.