Hello Members,
I am a complete newbie when it comes to interpreting RNA-Seq information (I'm actually studying computer science), but I have an interest in learning more about scientific research to see if this is a career I'd like to pursue. For this reason, I decided to do a summer internship in a biology lab to learn more about wet-lab techniques and possibly use my computer knowledge to help the lab I'm in.
Currently I'm running into some problems that I'm sure someone more familiar with this data can easily point out. I have been able to download an archive of the disease of interest and I'm interested in the RNASeqV2 information to determine expression levels, so I retrieved that data. After extracting the files I have 28 samples that each have the following type of files. For example, I get the following files for 28 different samples
.junction_quantification.txt
.rsem.genes.results
.rsem.isoforms.results
.rsem.genes.normalized_results
.rsem.isoforms.normalized_results
.bt.exon_quantification.txt
I have been told that the RNA expression is the most important so I've been focusing on interpreting the .rsem.genes.results
file and the .rsem.genes.normalized_results
file, but have been having difficulty. The first file is composed of 4 columns labeled gene_id
, raw_count
, scaled_estimate
, and transcript id
.
So I guess my first question is what is meant by the raw_count
and scaled_estimate
columns?
The second file (i.e. the .rsem.genes.normalized_results
file) has only 2 columns labeled gene_id
, and normalized_count
. What is meant by normalized count?
Also, the people I work with have told me that having normal cells to act as a control versus the cancer cells is important. Does the normalized results file include this information?
Any information you guys can give me would be greatly appreciated.
Thanks for the information man! This really helps. When you say that it's better to look at between tumor differences, how would this information be useful? I've looked around the forum a while and see that lack of controls seems to be a problem many are facing (Tcga Lack Of Controls - Workarounds?). Have you discovered a work around involving looking at between tumor differences?
Can't necessarily say it's better to look at between-tumor differences, just that it's probably all you can do. But just as you might want to look for clues as to what is going on in a tumor by comparing with adjacent normal tissue, you can do the same by looking at what differentiates a given tumor (or set of tumors) from other tumors. This is basically how all gene expression studies are performed as adjacent normal is hard to come by and in the case of my field (blca) rarely normal. Also if you have a set of tumors representing all stages and grades of disease you will have a fairly broad biological spectrum that you can use to tease out the most important themes in the data.
The link you provided discusses methylation data which I feel is more of my home turf. Whether or not you need normals here again depends on the question you want to ask. I'm generally interested in characterizing what differentiates one group of tumors from another, and for that I don't necessarily need normals.
So, adjust the question you want to ask to what your data can answer.
Makes sense, thanks again!
@Mattias Aine, do you have any new insight on why the sum of fractions is not one and sum <0.8 is not uncommon, please? I have had the same observation.
I found that genes don't include all isoforms, which explains why isoform-level scaled estimates sum to 1 while gene-level scaled estimates don't. Details are in https://gitlab.com/zyxue/understanding-firebrowse-data-format/blob/master/confirm-relationship-between-gene-level-and-isoform-level-scaled-estimates.ipynb
The link is private!
Sorry, I just made it public.
Hello Mr Aine,
I am doing prognostic by referring TF activity as two-gene ratio across samples. Under such circumstances, I am doing good in probeset dataset(hgu133a, hgu133plus2). But when I use normalized count (rsem) from TCGA, i found ratio strategy doesn't work anymore but direct (geneAcount-geneBcount) gives good results. Do you think it's feasible to do so?
Thank you!