Hi all!
I'm trying to integrate scRNAseq data from 2 different papers: Neftel et al. and Darmanis et al.. The data from Darmanis is obtained from SmartSeq2 and is in raw counts, while Neftel paper provides 2 datasets: one from 10X in raw counts (UMIs, I suppose) and the other from SmartSeq2 in TPM (raw data are not available). I'm working with scanpy and I haven't seen any tutorials that use TPM for scRNAseq (usually only CPM), but I figured I need to transform Neftel 10X data and Darmanis Smartseq2 data into TPM so that I can then integrate all datasets together (since I can't obtain raw counts from TPM).
Now, I am not sure how to normalize for gene length. As far as I understood it, TPMs provided for SmartSeq by the Neftel group were obtained using RSEM, and it seems like RSEM uses effective transcript lengths calculated independently for each sample as the weighted average of effective lengths of its isoforms (weighted by 'IsoPct').
So my questions are:
1) if I just download all transcript lengths from Biomart, compute some average values for transcripts of each gene, and then use it to calculate TPMs, will it be reasonable to use these TPMs to further integrate the three datasets? Or are they somewhat different and incompatible?
2) should I just use 1 as transcript length for the 10X dataset?
2) is it even a good idea to transform raw counts to TPM for this type of analysis or should I just remove the Neftel Smartseq2 dataset from the analysis and proceed with raw counts and CPM? I also plan to identify cell clusters in a combined dataset and find marker genes for them.
Unless the scRNA-seq variants of salmon or kallisto were used to calculate the TPMs, I would recommend either reprocessing the data to get raw counts, or excluding the dataset. Calculation of TPMs without building transcript level models often leads to misleading results since you don't know what isoform(s) are expressed.