I would like to find the CCLE RNA expression file that has either effective gene sizes or FPKM /RPKM (where estimated RSEM values have been used) to do our own upper quartile normalizations for CCLE gene expression. I don’t like the way the TPM protein coding RNA files have been generated by taking the larger TPM files for 53,000+ analytes and simply extracting values as is for the subset of protein coding genes. RSEM reads should first be filtered for only protein coding genes and TPM should have then been recalculated for protein coding genes, which would give a different result where all the protein coding gene TPMs from each sample would then add up to the same value of 1 million. To me it looks like this may not have been done properly. Therefore, I would like to perform my own data normalization only using protein coding genes. I can see a gene count and RPKM file under CCLE 2019 but the gene counts are not RSEM expected values (I think they are raw counts) and it is unclear if RPKM was calculated with effective or constant gene sizes and/or using RSEM or just the gene counts (i.e., raw counts) file
Original fastq data is available: https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=1&WebEnv=MCID_64d12e6c7645fa11d55e0f0f&o=acc_s%3Aa
It would be a significant amount of work (looks like 32 TB of raw data) but you can generate counts/data in any format you need.