Entering edit mode
2.3 years ago
qiz218591
▴
10
I am new to the TCGA data analysis for the differential gene expression analysis of BRCA samples. I doubt whether the data present on the TCGA harmonized GDC portal is normalized (protein profiling data and transciptome sequencing data )or not?? I have checked on multiple sites, and someone is saying it is, but someone is not.
RNA-Seq data (FPKM, TPM) are already normalized against library size, but for careful analysis, cross-sample normalization are normally desired using external tools such as EdgeR/DESeq.
How to validate that the data is already normalized? Is there any code or methods to validate? For example, you have said for the library size, so I would rather want to know about its validation.
Not sure what do you mean by "validate". FPKM and TPM are standard normalization methods around for more than 10 years, and the algorithm has total count size as denominator, if this is what you are asking for.
For RNA-seq, the GDC portal gives you the raw counts as well as the normalized counts (upper-quartile normalized FPKM [FPKM-UQ]).
Normalization is a tricky subject because there are literally an infinite number of ways to normalize, and yes, some methods work much better than others (and different normalization methods are suitable for different types of analyses). So "normalized counts" doesn't really mean much unless you tell us what kind of normalization you are looking for.
I have started using the EdgeR package, and later, I think of applying every Diff expression analysis package to normalize my data and see how it varies. As well as, I am confused about how to follow everything. I need to know the exact knowledge of the normalization procedure that can be followed to get the exact normalized results, and for which I am not able to get the exact article in this regard. I would want to know the methods of normalization that work better for normalizing data, as you have mentioned. I went with different manuals of Bioconductor packages but it's hard to interpret the result whether the normalization is performed or not.
Hence, all in all, I am looking for RNA seq raw counts data normalization through the EdgeR package presently.
Then, in answer to your question, GDC TCGA does NOT give you edgeR-normalized counts.
You download the counts from GDC and then you load it into edgeR, and from there, you can get edgeR-normalized counts.
thank you so much, I got it .