I just downloaded a GTF file from Ensembl and I noticed that the transcript_id tag is missing from some records in the attributes field. I read that transcript_id and gene_id are mandatory tags (see https://genome.ucsc.edu/FAQ/FAQformat.html#format4). Is the file corrupted or the requirements are relaxed?
Thanks in advance!
These are the attribute fields of the first two records (see missing transcript_id in the first one).
ADD COMMENT
• link
updated 22 months ago by
Ram
44k
•
written 9.6 years ago by
Pfs
▴
580
0
Entering edit mode
Which gtf file you are talking about? And why do you care about transcript_id field? It looks like from the description that you are using human gft file. I'd be very interested to know which version? I'd bet Ensembl 38, latest release. My understanding gft file had been modified in the latest release. Here is the problem I've encountered A: RNA-SeQC error no output Just scroll down to the bottom, this isn't my question, but read comments below.
I don't think it is a bug, rather a conscious decision by Ensembl, I believe. I am talking about Homo sapiens GTF file only. If you download a few different gtf version from here http://www.gencodegenes.org/releases/ and compare then. You will notice that gene attribute line in gtf file used to have gene_id field and transcript_id field. Here is the list of gft file that I have looked (compared) at:
gencode.v19.annotation.gtf
gencode.v20.annotation.gtf
gencode.v21.annotation.gtf
gencode.v22.annotation.gtf
gencode.v7.annotation.gtf
Homo_sapiens.GRCh37.62.gtf
Homo_sapiens.GRCh37.74.gtf
Homo_sapiens.GRCh37.75.gtf
Homo_sapiens.GRCh38.76.gtf
Homo_sapiens.GRCh38.77.gtf
Homo_sapiens.GRCh38.78.gtf
Homo_sapiens.GRCh38.79.gtf
In all of those gtf files except two latest ones Homo_sapiens.GRCh38.79 and gencode.v22.annotation.gtf. (which are the same annotation from two different sources) transcript_id field is present. BUT if you look closely transcript_id field in the gene line has the same value as the gene_id! And I understand this was a bug of some sort. So new, the latest gtf is actually an improved version. Although I suspect many tools might have not adapted to this as yet.
I'd be very interested to know more on this topic, because I feel its important to understand if this is or isn't a bug. Like I mentioned in my previous comment I couldn't perform RNA-SeQC report when I used Ensembl 38 genome annotation.
I've noticed this too with these GTF files and I was also quite surprised by the lack of transcript_id fields. I'm also not entirely sure whether this is intentional or a bug. From googling around when I first noticed this, it seems that the presence of "transcript_id" isn't always specified in descriptions of GTF. I think much of the problem is that there's no real gold standard specification for the format. The closest I've seen is from Ensembl, which basically says, "it's GFF version 2". In fact, even the examples that Ensembl gives lack transcript_ids. This makes sense now that some sources are including "gene" entries, for which a transcript_id has no meaning.
Perhaps we should push to get GTF taken over by the GA4GH file formats team. That'd at least allow a single format definition.
Edit: If others are in favor of the GA4GH route I'd be happy to contact them. Format spec. inconsistencies like this really need to be nipped in the bud.
AFAIK the GTF 2.0 format is actually defined by having the fields gene_id and transcript_id present. Otherwise it would be a GFF 2.0 file. On the other hand it was clearly ... what is even the right word ... unwise ... to introduce a new "format" called GTF for the sole reason of enforcing these two attributes.
A file that mixes rows of GFF an GTF is still a valid GFF file and as such should be called GFF. Of course it does not help that there is a GFF 3 format that is similar to GFF 2.0
I agreed with you until I found Ensembl explicitly defining GTF as GFF 2.0. I'm of the opinion that that was a bad move by Ensembl, but it becomes a question of who gets to define things. I think GTF2.2 as defined by the Brent lab is what most of us conceive of by the format, but even they mention revising the Ensembl GTF (aka GFF 2.0) definition.
I think I have an answer for this one. I wrote to Ensembl and here is the reply from them.
Until release 74, the gtf files had no gene lines, only transcripts. Since then, we have added these in, but they do not have a transcript_id attribute as a gene can have several transcripts and it is not a one-to-one relationship.
According to the gtf specifications, any non-required field should be ignored, so we did not anticipate that some software would break because of this addition.
We are now aware of this issue and are investigating a solution that would cover both use cases. In the mean time, I would recommend removing the gene lines from the gtf file before submitting them to the RNA-SeQC tool.
This was obviously regarding my particular need for transcript_id tag. In my case I simply removed all gene feature lines from the gtf file and RNA-SeQC worked like a charm. I feel that GENCODE gtf flles miss led me by having transcript_id tag in the gene feature line. In GENCODE gtf the value of transcript_id in the gene feature line is identical to gene_id value, which is miss leading in my view. And as mentioned in the email a gene can have more than one transcripts.
I hope this info will help clear some confusion. It have certainly helped me.
Personally I decided to differentiate the old Ensemble GTF format (until the release 74 ) of the new Ensemble GTF format (over the release 74) to call that last one "GTF3".
With the old version it was a bit painful to rebuild the transcripts and the genes from the exon and CDS features. Now they tend to a format close to the GFF3 defined by the sequence ontology consortium (http://www.sequenceontology.org/resources/gff3.html).
I try to use as most as possible the GFF3 format that has well defined specification.
I hope in the future that Ensemble shift the GTF format to the GFF3. This format allows them to still use their own Ensembl specific attributes (9th column).
ADD REPLY
• link
updated 22 months ago by
Ram
44k
•
written 9.5 years ago by
Juke34
8.9k
I try to run RNA-SeQC on my data using danRer10 gtf file downloaded from ensembl (latest version) as described here the format is newer compare to 74 release and my RNA-SeQC failed b'ze of the format compatibility issue...the error is
java.lang.RuntimeException: No rRNA found in GTF transcript_type field
at org.broadinstitute.cga.rnaseq.TranscriptList.toRRNAIntervalList(TranscriptList.java:414)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.createRefGeneAndRRNAFiles(RNASeqMetrics.java:1288)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.prepareFiles(RNASeqMetrics.java:191)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.execute(RNASeqMetrics.java:165)
at org.broadinstitute.cga.rnaseq.RNASeqMetrics.main(RNASeqMetrics.java:135)
My gtf file looks like this.. and exactly no "transcript_type" instead "transcript_biotype"
Then I edited the file gene_type and transcript_type where gene_biotype and transcript_biotype it work fine. Is this right and simplest way of change the latest format of gtf to run RNA-SeQC or will this cuase any issue in the results?
Hope someone could clear me :)
Thanks.
Justin
ADD COMMENT
• link
updated 2.2 years ago by
Ram
44k
•
written 9.2 years ago by
kani
▴
10
0
Entering edit mode
Please post things like this as a new question next time.
What you did should be fine, I'm surprised RNA-SeQC doesn't allow you to just specify the change with an option.
Which gtf file you are talking about? And why do you care about transcript_id field? It looks like from the description that you are using human gft file. I'd be very interested to know which version? I'd bet Ensembl 38, latest release. My understanding gft file had been modified in the latest release. Here is the problem I've encountered A: RNA-SeQC error no output Just scroll down to the bottom, this isn't my question, but read comments below.
I would say that the presence of that attribute is mandatory and is perhaps a bug
I don't think it is a bug, rather a conscious decision by Ensembl, I believe. I am talking about Homo sapiens GTF file only. If you download a few different gtf version from here http://www.gencodegenes.org/releases/ and compare then. You will notice that gene attribute line in gtf file used to have gene_id field and
transcript_id
field. Here is the list of gft file that I have looked (compared) at:In all of those gtf files except two latest ones
Homo_sapiens.GRCh38.79
andgencode.v22.annotation.gtf
. (which are the same annotation from two different sources)transcript_id
field is present. BUT if you look closelytranscript_id
field in the gene line has the same value as thegene_id
! And I understand this was a bug of some sort. So new, the latest gtf is actually an improved version. Although I suspect many tools might have not adapted to this as yet.I'd be very interested to know more on this topic, because I feel its important to understand if this is or isn't a bug. Like I mentioned in my previous comment I couldn't perform RNA-SeQC report when I used Ensembl 38 genome annotation.
I've noticed this too with these GTF files and I was also quite surprised by the lack of transcript_id fields. I'm also not entirely sure whether this is intentional or a bug. From googling around when I first noticed this, it seems that the presence of "transcript_id" isn't always specified in descriptions of GTF. I think much of the problem is that there's no real gold standard specification for the format. The closest I've seen is from Ensembl, which basically says, "it's GFF version 2". In fact, even the examples that Ensembl gives lack transcript_ids. This makes sense now that some sources are including "gene" entries, for which a transcript_id has no meaning.
Perhaps we should push to get GTF taken over by the GA4GH file formats team. That'd at least allow a single format definition.
Edit: If others are in favor of the GA4GH route I'd be happy to contact them. Format spec. inconsistencies like this really need to be nipped in the bud.
AFAIK the GTF 2.0 format is actually defined by having the fields
gene_id
andtranscript_id
present. Otherwise it would be a GFF 2.0 file. On the other hand it was clearly ... what is even the right word ... unwise ... to introduce a new "format" called GTF for the sole reason of enforcing these two attributes.http://mblab.wustl.edu/GTF22.html
A file that mixes rows of GFF an GTF is still a valid GFF file and as such should be called GFF. Of course it does not help that there is a GFF 3 format that is similar to GFF 2.0
I agreed with you until I found Ensembl explicitly defining GTF as GFF 2.0. I'm of the opinion that that was a bad move by Ensembl, but it becomes a question of who gets to define things. I think GTF2.2 as defined by the Brent lab is what most of us conceive of by the format, but even they mention revising the Ensembl GTF (aka GFF 2.0) definition.
I am referring to Homo_sapiens.GRCh38.79.gtf from Ensembl (yes, the latest release).
I realise that in my previous post I didn't supply correct link to the github issue. Here it is https://github.com/broadinstitute/RNA-SeQC/issues/1