How to make an intronic GTF for alignment
2
1
Entering edit mode
4.6 years ago
piyushjo ▴ 710

Hi,

I was struggling to find an intronic GTF for some RNA-seq alignment. Several page showed how to get an intronic bed, but I couldn't find a way to convert it into a GTF just like the exonic GTF, with "gene" and "transcript" parent structure. Finally I was able to find some tools and adjusted some commands to get what I wanted. Here, I am sharing my method, so others can comment if I missed something or use these commands to make intronic GTF for there own exonic GTF. I have visualized my output GTF and it looks what I wanted. For single exonic genes, you end up with a transcript with no exons, so I assume it will give zero counts in a STAR/HISAT2 run where exons are required.

Step 1: First, download you GTF of interest and unzip it. I am using the GENCODE GRCh38 p13 release 33 comprehensive GTF (PRI). I have sorted it, but don't know if that is essential. Extract the transcript information from it into a bed file. You need convert2bed from BEDOPS and subtract from BEDtools.

awk 'BEGIN{FS=OFS="\t"} $3 == "transcript"' gencdH38p13r33.sorted.gtf | convert2bed -i gtf - > trans.bed

Step 2: Get exons information in bed format from same .gtf

awk 'BEGIN{FS=OFS="\t"} $3 == "exon"' gencdH38p13r33.sorted.gtf | convert2bed -i gtf - > exons.bed

The exons include CDS and UTRs.

Step 3: Subtract exons from transcripts to get intron segment information from transcripts in bed format.

bedtools subtract -a trans.bed -b exons.bed >  introns.bed

Step 4: Convert introns.bed to GTF format :

awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' introns.bed > intron.gtf

Step 5: Replace "transcript" in column 3 with "exons" :

awk 'BEGIN{FS=OFS="\t"} {gsub("transcript","exon",$3);print}' intron.gtf > intronb.gtf

Step 6: Extract lines specifiying "gene" or "transcript" information from original gtf.

awk 'BEGIN{FS=OFS="\t"} $3 == "gene" || $3 == "transcript"' gencdH38p13r33.sorted.gtf > genetrans.gtf

Step 7: Now concatenate genetrans.gtf and intronb.gtf, this will result in final intronic gtf with gene, transcript and exon hierarchical structure. Here introns are labelled as "exons". You could also add the intronb.gtf to the full GTF, in that case, change the "transcript" label to "intron" (Step 5) and specify "intron" for your alignment.

cat genetrans.gtf intronb.gtf > intronic.gtf

Voila!!! Check it in IGV and compare with original GTF to see if it worked. I was able to get introns for overlapping genes, even if they were on same or opposite strands. The code did give two errors during subtract steps, but my random gene check didn't give an unwanted results. Please let me know your feedback. Hope this is useful!

Piyush

RNA-Seq intronic-gtf • 4.6k views
ADD COMMENT
7
Entering edit mode
4.6 years ago
Juke34 8.9k

If I understand well you wanted to add introns in a GTF file? First of all and just for information GTF specification does not accept intron as feature type (except GTF version1) (find information here).

Good you found your way but here another way to do probably easier by using AGAT:

conda install -c bioconda agat
agat_sp_add_introns.pl --gff gencdH38p13r33.sorted.gtf -o gencdH38p13r33.sorted.intron.gff

Then if you want to go back to a (non-official) GTF file

agat_convert_sp_gff2gtf.pl --gff gencdH38p13r33.sorted.intron.gff --relax -o  gencdH38p13r33.sorted.intron.gtf
ADD COMMENT
1
Entering edit mode

Hi Juke,

Thanks. I actually wanted a GTF file which I can use for to count intronic reads for all genes, just like normal alignment for exons. I didn't know that intron is not a valid specification for GTF, so thanks for pointing that out. I also wanted to have "gene_id" and "transcript_id" information associated with each introns, so I can have intronic transcripts for each transcript, which I was able to get following the way I posted. I didn't post it in question as I had already figured out what I wanted and just wanted to share the pipeline.

Also thanks for your solution and feedback.

ADD REPLY
3
Entering edit mode
4.6 years ago

Its not clear what an intronic GTF should contain. Consider the following situation:

You have a gene that has two transcripts:

|>>>>>|---------|>>>>>| transcript 1
|>>>>>|------|>>>>>>>>| transcript 2

if you want the introns are you looking for:

      |>>>>>>>>>|       option 1 - ever intronic
      |>>>>>>|          option 2 - always intronic
                        option 3 - both of the above (per-transcript introns)
      |>>>>>>|>>|       option 4 - intron segments

you can generate all these options using the gtf2gtf tool form cgat-apps.

Note that if you want to quantify with featureCounts/HTSeq-count, then option 3 probably won't work - featureCounts will treat option 3 as option 1 or 2 depending on what settings you have, or worst ignore the reads mapping to the overlapping part altogether.

Option 1:

cgat gtf2gtf --stdin=input.gtf.gz --method=exons2introns  \
| awk -v OFS="\t" '{$3="exon"; print} \
| gzip > outfile.gtf.gz

Option 2:

cgat gtf2gtf --stdin=input.gtf.gz --method=merge-exons \
| cgat gtf2gtf --method=exons2introns --stdout=outfile.gtf.gz \
| awk -v OFS="\t" '{$3="exon"; print} \
| gzip > outfile.gtf.gz

Option 3:

cgat gtf2gtf --method=set-gene-to-transcripts --stdin=infile.gtf.gz \
| cgat gtf2gtf --method=exons2introns --stdout=outfile.gtf.gz \
| awk -v OFS="\t" '{$3="exon"; print} \
| gzip > outfile.gtf.gz

Option 4:

cgat gtf2gtf --method=genes-to-unique-chunks --stdin=infile.gtf.gz \
| bedtools intersect -v -a - -b <(zcat infile.gtf.gz | awk '$3=="exon"' ) \
| gzip > outfile.gtf.gz

You should be able to install cgat-apps from bioconda.

I think you should be able to ignore the awk bit by telling featureCounts to use the "intron" feature for its counting, but I'm not sure.

ADD COMMENT

Login before adding your answer.

Traffic: 1712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6