Hi there, I want to use the gene expression data from GDC/TCGA for further analysis, e.g. clustering across multiple cancer types. GDC offers the gene expression data in three versions: counts, FPKM, and FPKM-UQ. I am aware that gene expression data sets need to undergo some preprocessing steps, e.g. filtering for outliers and normalization. I am a newbie to this kind of data processing and analysis.
Now my question(s): - What is the state of the art preprocessing pipeline for the raw counts? I have found so many sources in my online search that I am totally unsure of what is best practice. - Does the data in FPKM-UQ format require any preprocessing? Should I use it at all? On the official documentation web page, it sounds as if they have normalized the data set with the intent for cross-sample comparison, but I have not found any workflow or similar working with this kind of data yet. Also, I most often read that data filtering should be conducted before normalization, which would not be possible here.
Any suggestion or help would be highly appreciated!
First of all you have to diced which files are better for your analysis, and then you can do a quality control analysis to see if further process are required for your approach. Once you select your files, you have to understand if your downstream analysis required internal normalization. For example, I used HTSeq files for DESeq analysis, to address this issue I did a QC and then normalization according to the program. I hope this can help.
Hi Lila, thanks for your reply (seems I do not get notifications about comments...). In the end I decided to use raw counts and have now built somewhat I can call a preprocessing pipeline by applying some sample workflows I have found on Bioconductor. What I understand now is that it seems that there is no state-of-the-art at all, only common practices when using this or that tool. This, unfortunately, makes it kind of hard to decide on a preprocessing strategy for newbies.