I have downloaded a bulk RNA-seq data frame for this article. All data was downloaded from here.
In supplementary_table_4, the first sheet is READ ME, and it says that sheet 8 is "Raw RNA-seq log2(TPM+1) values for all 261 samples without batch-effects correction". And this is great, this is what I need, TPM normalization.
I downloaded this data, I undid the log2 transformation with df = 2^df - 1
, but the sum of the columns is not equal to 1 million, which indicated that the data is actually not TPM normalized.
The sum of the samples (the columns) that I'm getting is around 240,000 and 200,000. I have done nothing to the data other that picking the samples that I want, that's it. How come they say that this is log2(TPM+1) (raw, with no batch corrections) and still the sum of the columns are far from 1 million (after undoing the log2 transformation)?
Which answer do you expect? With public data you're bound to what the authors uploaded, this can be helpful or utter nonsense. If you don't feel like it's reliable then don't use it and obtain raw counts, either by processing raw data or getting a count matrix via a package such as recount, or websites such as BioJupies.
Perhaps someone is familiar with this paper? maybe someone faced a situation like this before, I really don't know. I've read the paper (most of it), still can't make sense of this TPM normalization..
Agreed. However, unfortunately in many clinical datasets, raw reads are not readily available due to privacy issues. Not sure if this is the case here. If so, hopefully there are the raw counts (the ones generated by STAR) uploaded somewhere.
Otherwise, I'd just recommend emailing the corresponding author (and first author if possible).
Seeing less than 20k genes in the table.. they removed genes based on something. They reanalyzed public data, so.. why not merging tables by your own?