I am working on a 200-samples RNASeq dataset and I planned to use tin.py to calculate the TIN which calculates the integrity of transcripts in your sample. The script was firstly puplished in 2016 to normalize counts accounting for biases generated by RNA degradation (the manuscript is available here( TIN ).
Since my samples were FFPE embedded, from RNASeq data I observed different levels of degradation. Tin.py works with a bed12 comprised of all the transcripts and it calculates the TIN per transcript. The bed12 was generated from the GENCODE .gtf file (v39). My intention is to normalize counts for DE in dependence of the GENE, however multiple TINs per gene are generated (different transcripts).
In the paper they claimed to normalize for the median TIN among all transcripts belonging to the specific gene (Methods in the link above, Section: Normalizing gene level read counts using TIN metric). Assuming that not all the transcripts are expressed in the same way, it is reasonable to assume that some TINs may be under/over-estimated if I use the median among all isoforms to calculate the gene TIN ?
As far as I understood from the manuscript, the TIN is the percentage of nucleotides with a 'homogeneous' coverage per transcript. s it reasonable to use for TIN calculation a GENE-based bed resulting from the union of non-overlapping exons from all the isoforms from the same gene?
I tried with one sample but, as you can see from the image below, using the 'GENE' version of the bed (x axis, ENSG) reports lower TIN values per gene respect to the 'TRANSCRIPT' version (median TIN among all transcripts, y axis, ENST). It is possible that merging all transcripts would give too wide intervals for TIN calculation, thus falling in TIN underestimation. In this scenarion, the median TIN would buffer low-expressed isoforms.