Question

hg38 gtf file containing all RNAs and features

0

Entering edit mode

6.4 years ago

CrisMar ▴ 80

Hi All, I annotated a bed file with an hg38 gtf file (from gencode) using bedtools intersect. Annotating bed files with Crosslinking induced truncations sites from iCLIP experiment (chromosome locations).

Although it worked well the attributes column in the hg38 file is a mess.

Trying to separate each attribute in one column with the corresponding attribute turned out to be tricky (for me.)
Not all the attributes I needed are present. For example, I need to eventually show how many 3'UTRs, 5'UTRs are present etc.
I tried downloading a specific gtf file from UCSC browser but cannot get all the RNAs (mRNAs, miRNAs, lnRNAs etc) in one file to do the bedtools analysis.
Trying to use different gtf/gff parser's but none seem to work well (or difficult to use).

Any suggestions appreciated.

-Learning.

gtf annotations parser attributes bedtools • 2.3k views

ADD COMMENT • link 6.4 years ago by CrisMar ▴ 80

0

Entering edit mode

Can you show which files you have and what are you trying to get? In general, I wouldn't recommend doing bedtools intersect on a gtf file because bedtools don't really understanf the relations between features like gene -> transcript -> exon and your output file might get very messed up. Definitely check it in Genome Browser and look if all your exons, transcripts are in place, etc.

ADD REPLY • link 6.4 years ago by marina.v.yurieva ▴ 580

score 0 · Answer 1 · 2018-08-06

Yes, I have this bed file:

$head CITS.bed

chr1    568974  568975  CITS_1[gene=chr1_f_c24][PH=12][PH0=0.29][P=1.01e-12]   12   +
chr1    2239149 2239150 CITS_2[gene=chr1_f_c1136][PH=7][PH0=0.40][P=2.21e-04]   7   +
chr1    2239899 2239900 CITS_3[gene=chr1_f_c1138][PH=6][PH0=0.21][P=3.56e-04]   6   +
chr1    2461199 2461200 CITS_4[gene=chr1_f_c1237][PH=5][PH0=0.17][P=1.46e-04]   5   +

And I want to get something like this (as a random example) with each attribute in a different column but each column corresponding to one attribute.

chr1    568974  568975  CITS_1[gene=chr1_f_c24][PH=12][PH0=0.29][P=1.01e-12]   12   +   Gene_ID:EST000000   Gene_name: GeneX  Transcript_name: Transcript X  Feature: 5'UTR
chr1    2239149 2239150 CITS_2[gene=chr1_f_c1136][PH=7][PH0=0.40][P=2.21e-04]   7   + Gene_ID:EST0000001   Gene_name: GeneY  Transcript_name: Transcript Y  Feature: lnRNA
chr1    2239899 2239900 CITS_3[gene=chr1_f_c1138][PH=6][PH0=0.21][P=3.56e-04]   6   + Gene_ID:EST0000002   Gene_name: GeneZ  Transcript_name: Transcript Z  Feature: miRNA001

The bed file contains only RNA reads (mRNAs, miRNAs, lnRNAs, snRNAs). I had originally converted the gtf file into a bed file before using bedtools intersect.

But yes you are correct, the gtf file (gencode.v28.annotation.hg38.gtf) is really messy (attributes column):

chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";