I couldn't find a proper documentation of the software used for generating the read counts of the TCGA level 3 data.
I have done 21 normal sample Vs 21 tumor samples analysis using TCGA RNASeq level3 data to find deferentially expressed genes using DESeq.
And further I have taken Illumina body map SRA file, processed using TOPHAT and generated counts using HTSeq. The HTSeq read counts generated using TOPHAT bam files were compared with 21 tumor sample from TCGA level 3 data.
So, now as expected the differential expressed genes using DESeq between "Illumina body map comparison with 21 tumor samples" and "21 normal sample Vs 21 tumor samples from TCGA" should have good overlap of deferentially expressed genes. But the overlapping genes are very less.
Does this means there is something wrong with the processing of illumina body map file and or due to the variation in protocol followed for TCGA data?
Could anyone tell me how the read counts in the TCGA level 3 data is generated? Using which program?
I'm 99% certain that they use RSEM (after mapsplice, I think), though I don't know the version numbers or any options that they specify. I imagine that this could give rather divergent results from tophat2 -> htseq-count...there'd at least be a batch effect.
I have download TCGA RNASeqV1 which uses RPKM instead of RSEM, I think RNASeqV2 uses RSEM? And if it is batch effect, what can be done to get rid of batch effect?
And as far as my knowledge TCGA RNASeqV1 (TOPHAT2+cuffdiff+cufflink) uses Tuxedo pipeline to do the analysis, But if this is the case how do they generate raw count?
Ah, yeah, V1 data is different and I don't know off-hand how that was made. If they did use cufflinks then it's unlikely that they used raw counts at any point (though you can use the merge GTF file with htseq-count to get them).
Comparison across data sets generated by different groups is not something that you should expect to work well. To make matters worse, the data processing for the different sets appears to be quite different. A lot of folks seem to make the assumption that since "it is all RNA-seq", it should be possible to make comparisons between any two datasets. Unfortunately, that is generally not true. The same problems exist as for microarrays. Batch effect is something that can be minimized, but not ignored.
If I could up-vote this more than once I would. It's amazing how many people try to ignore the simple truth that, "Batch effect is something that can be minimized, but not ignored."
I totally agree. As I mentioned below, we re-processed all of the RNA-Seq data from multiple datasets as part of our pipeline, in order to minimize the batch effect. You still have differences but at least it can be minimized. TCGA was especially messy due to different aligners used, different genomes used, etc.
How do you solve your problem at end?I just want to combine TCGA level 3 data with my RNAseq htseq-count to get differential expression gene.But I'm not sure what parameters to use to rerun my RNASeq raw data with MapSplice &RSEM in order to be in accordance with TCGA level 3 data.