Hi everybody! I'm working on several metagenomes (so not trascriptomes), and I mapped high quality reads to a database made of nucleotide sequences through bowtie2. I want to convert raw mapping counts of each gene in RPKM. I know this method might not be appropriate for this purpose, so please don't take this into account.
I obtained the formula to use from here. So, this is the question: in my case, 'totalNumReads' should refer to the total number of reads that successfully aligned to a gene (that is, the sample-wise sum of the counts), or to the library size, that is the total number of reads that I have in the metagenome (so also accounting for unmapped reads).
I am curious to know your opinion about this. Thanks
It's reads mapped to genes, so basically if you have a matrix of counts with rows being genes and columns being samples -- it is the sum of every column. Unmapped reads make no sense as that would change the expression of a gene if you have more unmapped reads, e.g. a more noise dataset but that has nothing to do with reads assigned to genes.
Ok, I got you, but in this case I'm not taking into account the size of the library, right? For example, if I have metagenome1 with 20M reads and metagenome2 with 10M reads, the count of reads mapping to a gene X might be higher in metagenome1 solely because it has more reads. And this might be independent from the sum of reads mapping to all the genes. This is the point that confuses me most.
The library size is the sum of the column, so it takes it into account. The colSum is the effective library size that matters.