Question

What does Ensembl/Genecode gene_type "processed_gene" really mean?

1

Entering edit mode

8.1 years ago

james.lloyd ▴ 100

I am using the gencode.v19.annotation.gtf (head of the GTF below) to assign gene_types to the transcripts in my study via ensembl gene IDs. And example line from the GTF is also below.

Some gene_types have the name processed_transcript while other are lincRNA or antisense etc.

Ensembl just list the processed_transcript "biotype" under long non-coding transcript. That makes sense given I understand a processed transcript are those that do not have an ORF http://uswest.ensembl.org/Help/Faq?id=468

But what is unclear to me is what is the difference between a processed_transcript and these other long non-coding transcripts? According to Vega http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html, processed_transcript is above these other long ncRNAs in a hierarchy, which makes sense except I see many transcripts with this annotation and not on of the subtypes like lincRNA. Why would that be?

Based on what Genecode has written about biotypes https://www.gencodegenes.org/gencode_biotypes.html, I guess something would be processed_transcript if it has no ORF and does not meat the criteria for other catagories like lincRNA or antisense. Does anyone know if this is true?

Header:

##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74) 
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05

Example line:

chr1    HAVANA  gene    11869   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

RNA-Seq annotation ensembl • 6.4k views

ADD COMMENT • link updated 8.1 years ago by i.sudbery 20k • written 8.1 years ago by james.lloyd ▴ 100

1

Entering edit mode

For pseudogenes 'processed' vs 'nonprocessed' categories are well defined in the wikipedia link Processed involve retrotransposition and have cds like gene structure without introns.

ADD REPLY • link 8.1 years ago by microfuge ★ 1.9k

0

Entering edit mode

Thanks but these are not for pseudogenes. They have a different gene_type in the databases which I understand the distinction.

ADD REPLY • link 8.1 years ago by james.lloyd ▴ 100

score 2 · Answer 1 · 2016-10-18

I think your assumption is right: "processed_transcript if it has no ORF and does not meat the criteria for other categories like lincRNA or antisense".

The processed transcript category would have been used long before annotating things that are now called antisense and lncRNA. There is more data and guidelines to annotate the latter now so they do not fall in the broad category of processed_transcript (i.e. transcripts that are 'processed' by the cell; they are spliced and can have a polyA tail added to them).

Everything that does not get classified as lncRNA and ncRNA will be tagged as processed_transcript (i.e. the unclassified in the VEGA help page).

score 1 · Answer 2 · 2016-10-18

1

Entering edit mode

8.1 years ago

i.sudbery 20k

My understanding was that "processed_transcripts" are non-coding transcripts associated with a gene that does have a coding isoform. For a transcript to be a lincRNA, it needs to be part an entirely non-coding gene.

ADD COMMENT • link 8.1 years ago by i.sudbery 20k

0

Entering edit mode

I think your understanding is right @i.sudbery. Perhaps my answer was a bit confusing...Processed transcripts can be either used as a gene type or a transcript type. One can indeed have a processed transcript in a locus that is coding. That's really common. In cases like that, the gene type will be protein_coding (not processed_transcript) and the non-coding transcript will be processed_transcript. One can also have a locus that gets both the gene and transcript types as 'processed_transcripts' (perhaps not too common e.g AC005614.5. The lncRNA is a transcript in a gene type that is classified by GENCODE as 'processed_transcript'. There should not be a lncRNA in a gene that is coding. I always find useful to look at those tricky cases using the browser, and BioMart can help out when trying to find these examples (just search for processed_transcripts in the FILTERS under gene type or transcript type).

ADD REPLY • link 8.1 years ago by Denise CS ★ 5.2k

0

Entering edit mode

It is a little confusing that a gene biotype can contain the word "transcript" and contain multiple transcripts whose biotype is not "processed_transcript".

processed_transcript implies that it refers to a transcript, and a single transcript at that.

ADD REPLY • link 8.1 years ago by i.sudbery 20k