Hello,
I am very beginner in this world of RNA-seq and I am trying to do a de novo isoform analysis of a particular gene using TCGA. so far, I manage to download the exon_quantification data and using hg19 I figured the locus of each exon for this gene. So now I have the reads of each exon. My data looks like this:
raw_counts median_length_normalized RPKM
56 0.9254658 2.097887247
43 0.8739496 2.174684905
82 1 4.927216087
341 0.9669247 1.455338147
48 1 3.003161125
52 1 4.383085855
109 1 5.512573364
119 1 4.570871422
158 1 4.942702685
621 0.571371 0.759681418
And it is like 550 patients. I do not really know what I should do with these. Should I focus on raw_counts or RPKM? and How can I figure if this gene is spliced and if so where exactly and how to quantify? What does the median length tell me?
Thank you very much for your help.
Hi, Is your splice isoform of interest a known isoform? If yes, then you could as well look into the
rsem.isoform.normalized
files. The data appears like this -You can then check for your isoform of interest. If your isoform of interest is a novel one then I am not sure if you would have any use in looking into TCGA Level 3 data (which is what you are looking now). Some details about different files available are here.
You could try your luck though on the junction_quantification files. The data is like -
I am not sure though if the parameters provided to RSEM (the prog. used) allowed spitting out novel junctions as well. If not then you might want to get access to BAM files from TCGA (through license).
If you are going to compare values across samples, then you should use RPKM rather than raw counts. Also, be aware that there might be batch effects operating across samples (due to samples processed on different dates) and hence you might want to do batch effect removal by yourself or use data from here.
Thank you very much for your help. I also downloaded the rsem.isoform.results file which gives me the isoform id and the raw_count and scaled_estimate. I guess I need to use the scaled_estimate to compare across samples?! And I did not get why I can not use the exon_quantification level 3 data for a novel isoform. according to your link this file has the counts mapped to a specific exon. So can't we just use those? Like in my example the raw counts for the exon 4 and 10 is a lot more, does that mean it is transcribed as another isoform?! Sorry it sounds very naive but I am completely lost I will download the junction quant file too and see if I can find a way to use that. Thanks again for your help
The reason I said that the Level 3 data might not be useful is that some of the transcript assemblers I have used, like Cufflinks or StringTie, have an option of returning the expression levels of only the known transcripts isoforms. You can of course turn it off and ask it to give both: known & novel. In case of TCGA RNA-seq Level 3 data, RSEM has been used and I am unaware of how it works, i.e. if it was asked to return values of novel isoforms or not.
Again I can not comment anything helpful on how to interpret on the exon_quantification results as I haven't read much about the RSEM methodology. I have only used the gene level or isoform level data. What I can only think is does your isoform have a unique exon-exon junction? That which is not present in any other isoform from that genic locus? If yes then you can look for that junction in the junction_quant file. That would be the lowest hanging fruit, I guess you can aim for.
Thank you for your input. Let me try and see how it goes. I will update it here for the rest of the people