I was doing a comparison between gene expressions for TCGA data obtained from microarrays and rna sequencing. I downloaded normalized data from firebrowse and when comparing gene names in those two platforms, I noticed that there are nearly 1000 genes that are present in microarray data but not in rna-seq data. It's in contradiction with my understanding of rna sequencing because I think that rna sequencing must give us whole transcriptome! So why there are genes that are not present in TCGA normalized rna-seq data?
(Sorry, I am begginner in bioinformatics!)
Is the data prefiltered? e.g. all genes with an expression <1 FPKM filtered out? What is the depth to which the data was sequenced? Perhaps the genes you are missing are just very lowly expressed.
Or is the RNA-seq polyA enriched and the microarray isn't, and you're looking at rRNA genes?
Just a few thoughts...
Thanks. I think that data is not filtered because there are many genes with 0 reads across many samples and also there are exactly 20532 genes in all the cohorts.
I will take a look at preparation methods, but I think both of them are using polyA filtering and measuring mRNA levels.