I am working with transcript counts produced by RSEM which gives me expected_count, TPM and FPKM values. I usually work with TPM values as the counts have been normalized for transcript length. I would like to use ComBat-seq for batch effect removal. The documentation https://github.com/zhangyuqing/ComBat-seq says ComBat-seq requires
untransformed, raw count matrix as input
It also says:
ComBat-seq provides adjusted data which preserves the integer nature of counts.
Since none of the counts produced by RSEM are integer, I'm not clear on what ComBat-seq is asking me to provide. It would seem that TPM would be appropriate as transcript length has been taken into account but the 'integer' part brings that into question.
Can anyone provide clarity on what I should pass into ComBat-Seq? thank you
Does ComBat-seq work on transcript level counts rather than the gene level, so does the mapping uncertainty play a role here?
The ComBat-seq paper only mentions gene-level counts. I would guess that mapping uncertaintly would be a major issue for transcript level counts, and that ComBat-seq is only designed for gene level counts, but the ComBat-seq authors would have to confirm.
It isn't clear to me whether OP has gene level or transcript level counts. The question says "transcript counts" but also mentions "gene lengths".
Thank you for the response. I am using Transcript counts, sorry, I should have said transcript length and have now edited the post to update that.
I am interested in the relative expression of transcripts within and between samples, which is why I was using TPM since the relative expression of transcripts within the sample will have been normalized in the TPM (i.e. expected_count / effective_length). My counts were also measured in different technologies, some via NanoString which counts the existence of a particular sequence, and RNA-Seq where multiple reads map into a single transcript and therefore are amplified by expected_length. My current pipeline takes TPM values from RNA-Seq and NanoString counts and normalizes them together using geometric means as in DESeq2. I then wanted to use ComBat-Seq to correct for batch effects. If TPM is out of the question for ComBat-Seq do you have any suggestions how to unify these data? Should I amplify my NanoString counts by effective_length to simulate expected_counts and pass these to ComBat-Seq?
Hello, I'm going through the same issue with RNASeq and Nanostring data. If you find any solution for this, do you mind sharing it with me? Thanks in advance.