A GTF file I have downloaded from iGenome contains a "tss_id" attribute.
I suspect this is to match different annotated feature based on the fact that they are relative to a same transcript, i.e. the transcript they relate to come from a same transcription start site (TSS). Am I correct?
How is this attribute determined? Where does it come from?
Is there some TSS database somewhere in which these tss_id are used ?
tss_id (and p_id, or protein id) are required by the cuffdiff program to perform all the differential splicing/coding contrasts. I have not seen they used anywhere else ever.
It is possible to assign tss_id to a GTF file. All transcripts of a given gene having the same start position of their first exon should be assigned the same tss_id.
Similarly, all transcripts having the same coding sequence (though different UTR) should be assigned the same p_id.
Genomes from iGenomes already have tss_id and p_id assigned that follow these guides. There is no external source of additional information. I have asked Illumina how they assign them and they declined to answer.
For your pleasure, I have developed an Rscript, cuffdiff_gtf_attributes, which does this, and have tested it with Ensembl GTF. The modified GTF files allow me to perform differential isoform and protein detection with cuffdiff.
Thanks for clarifying what tss_id and p_id correspond to and for the potentially useful script.
You say that a given tss_id refers to a same start position of the first exon.
In the particular case of C. elegans, many genes are subject to trans-splicing: the 5' end of the transcript is replaced by an RNA coming from somewhere else (see http://wormbook.org/chapters/www_transsplicingoperons/transsplicingoperons.html). The transcription start site can therefore be upstream of the first exon. I don't know if the gtf files from illumina take the real TSS (when it is known) or the starting position of the first exon when they attribute a tss_id.
This comes from and is only needed by the cufflinks suite of programs. You can take any GTF file without these and add them with cuffcompare. If you don't need to use cufflinks then completely ignore the tss_id.
Thanks for clarifying what tss_id and p_id correspond to and for the potentially useful script.
You say that a given tss_id refers to a same start position of the first exon. In the particular case of C. elegans, many genes are subject to trans-splicing: the 5' end of the transcript is replaced by an RNA coming from somewhere else (see http://wormbook.org/chapters/www_transsplicingoperons/transsplicingoperons.html). The transcription start site can therefore be upstream of the first exon. I don't know if the gtf files from illumina take the real TSS (when it is known) or the starting position of the first exon when they attribute a tss_id.
I do not think GTF provides specifically for representing trans-splicing at all.