I'm fairly new to bioinformatics so please excuse my basic questions...
I am trying to analyse RNA-Seq data from TCGA and I came across this tutorial Survival analysis of TCGA patients integrating gene expression (RNASeq) data ... In there it is advised to remove genes whose expression is = 0 in more than 50% of the samples... Since this data has already some level of preprocessing, I was wonderig if an expression level of 0 meant that the gene should not be considered in further analysis because it has no expression or does it mean that the level of gene expression was very low so it was set to 0?
Thanks!
Just a general advice on TCGA data: I would recommend to analyze the data from scratch (fastq files which you can acquire by using Picard Tools), rather then the provided BAM files. This is the only way that you can be sure that proper QC is being performed.
The BAM and FASTQ files are access controlled, of course. I recently re-analysed TCGA RNA-seq data but from the HTseq raw counts (open access). I never use pre-normalisd counts from Broad, TCGA Biolinks, or other sources.