I am using the gencode.v19.annotation.gtf (head of the GTF below) to assign gene_types to the transcripts in my study via ensembl gene IDs. And example line from the GTF is also below.
Some gene_types have the name processed_transcript while other are lincRNA or antisense etc.
Ensembl just list the processed_transcript "biotype" under long non-coding transcript. That makes sense given I understand a processed transcript are those that do not have an ORF http://uswest.ensembl.org/Help/Faq?id=468
But what is unclear to me is what is the difference between a processed_transcript and these other long non-coding transcripts? According to Vega http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html, processed_transcript is above these other long ncRNAs in a hierarchy, which makes sense except I see many transcripts with this annotation and not on of the subtypes like lincRNA. Why would that be?
Based on what Genecode has written about biotypes https://www.gencodegenes.org/gencode_biotypes.html, I guess something would be processed_transcript if it has no ORF and does not meat the criteria for other catagories like lincRNA or antisense. Does anyone know if this is true?
Header:
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
Example line:
chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
For pseudogenes 'processed' vs 'nonprocessed' categories are well defined in the wikipedia link Processed involve retrotransposition and have cds like gene structure without introns.
Thanks but these are not for pseudogenes. They have a different gene_type in the databases which I understand the distinction.