Question

Difference between CPM and RPM for RNA-seq reada quantification

2

Entering edit mode

7.2 years ago

Emilio Marmol ▴ 180

Hello everyone, I have a dataset of differentially expressed miRNAs.

I aligned my microRNA reads (10 millions per sample aprox) against reference genome, obtaining an average of 7.5 million aligned read per sample, and used FeatureCounts from RSubRead to count them asigning to reference miRNA GTF in miRBase, obtaining an average of 60% of asignment. Then, I used DESeq2 to perform differential expression analysis.

As said in Mullokandov et al. (2013), functional miRNAs should only be considered when expressed above 100-1000 RPM on average. I calculated CPM from FeatureCounts counts file and obtained average expression per group (2 groups contast) and expression per sample.

I would like to know if there is any difference between the concept of RPM and CPM, as I think the are quite the same, but not sure about that. I assumed miRNAs with a cpm expression below 100 to be spurious and noise, but I would like to be sure about that.

Any suggestions and clarifications?

Thanks

RNA-Seq miRNAs CPM RPM • 12k views

ADD COMMENT • link updated 7.2 years ago by mforde84 ★ 1.4k • written 7.2 years ago by Emilio Marmol ▴ 180

0

Entering edit mode

RPM=CPM=FPM (single end reads).

CPM cutoff of 100 sounds like a lot. But I have limited experience in miRNA sequence analysis. Here is one post for CPM cutoff for miRNA data.

ADD REPLY • link 7.2 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

I understand what that post says, but this is not refereing to biologically active miRNAs but to initially threshold filtering for eliminating reads with zeros and very small counts.

100-1000 or more RPM (or CPM but I am not sure XD) threshold is considered for withing the cell miRNA capability to drive a significant response in mRNA targeting and repressing gene expression and not just slight and spurious reactions. It is a matter of selecting really interesting DE miRNAs for discussion rather than for statistical analyses, in which I already applied such pre-filtering anyway.

ADD REPLY • link 7.2 years ago by Emilio Marmol ▴ 180

0

Entering edit mode

Following sentence was not clear

I assumed miRNAs with a cpm expression below 100 to be spurious and noise

Gave an impression that you are looking for cutoff for expression and not function.

ADD REPLY • link 7.2 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

well, I assumed less than 100 CPMs to be a very low expression level so that the miRNA could not drive a significant function in repressing gene expression. If you have differential expression but with single or very scarce miRNA molecules, the effect is not expected to be such a thing to take into consideration. Maybe for profiling in cancer or so could be useful, but not for infering metabolic changes.

ADD REPLY • link 7.2 years ago by Emilio Marmol ▴ 180

0

Entering edit mode

No CPM, aka TPM, and F(R)PKM are not the same!! I think it is time for some StatQuest. You may also want to google for blogs and videos of Lior Pachter, one of the leading RNA-seq analysis guys on that matter.

ADD REPLY • link 7.2 years ago by ATpoint 85k

0

Entering edit mode

CPM is not TPM...

Question is the difference between counts per million and reads per million.

ADD REPLY • link 7.2 years ago by mforde84 ★ 1.4k

0

Entering edit mode

I'm not asking for RPKM, TPM or FPKM sorry...

ADD REPLY • link 7.2 years ago by Emilio Marmol ▴ 180

0

Entering edit mode

Hi v82masae,

I have few queries regarding your miRNA requirement.

You might have used Illumina sequencing platform to sequence miRNA. After adapter trimming what was the length of the small RNA you kept?
Did you compare your sequenced miRNA against miRBase database?
What tools did you use to generate differentially expressed miRNAs?

ADD REPLY • link 7.2 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

II did miRNA sequencing yes, with HiSeq 2500 Illumina, single-end.

I used Cutadapt for trimming adapters with an acceptance window of 15-30 nt (quite big, as reported windows are normally 18-25, but I wanted to retain big fragments for contrasting against circRNAs and precursor forms). I tried a smaller window and the output did not differ quite much, just a few thousand reads if not less.

After trimming, I aligned the reads against reference genome with about 80% mapping, and used FeatureCounts for counting reads in BAM files with an average 60% of successful assignment to GTF miRNA annotation file downloaded from miRBase v21 yes.

For differential expression I used DESeq2 and EdgeR, both giving quite similar results and kept DESeq2 just for department papers homogeneity as we are using DESeq2 for the majority of other papers from same project.

ADD REPLY • link 7.2 years ago by Emilio Marmol ▴ 180

0

Entering edit mode

Thanks, V82masae.

Based on this paper, I used the mirna_soft tool (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4138881/).

Even we sequenced using Illumina, 51 bp single-end. After trimming, I retained the length between 18-35nt. Then I compared the sequenced reads against the miRBase v21 database. For Bovine, I extracted bovine specific miRNA from mature.fa file from miRBase v21.

Did you do alignment with your species miRNAs or reference genome?
Did you use bowtie2 aligner?

I got the output file with miRNA id, sequence, count of unique matches and count of multi-matches. For one of the samples, total trimmed reads- 4,679,140. Unique matches to the bovine miRNA sequences (from miRBase) was 800,000 matches. unmatched reads - 3,796,387

ADD REPLY • link 7.2 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

I did alignment from trimmed reads to reference genome yes. As my model organism (pig) had a very poorly annotated miRNAome, I also aligned my reads agains human and mouse genome and did all the pipeline to detect miRNAs differentially expressed that were not described in the miRNA GTD annotation in pigs, considering that within phylogenetically close species, miRNA sequences are commonly conserved.

I tried some de novo miRNA finding algorithms like mirDeep2, but the output did not convince me, too noisy. I opted for finding manually some DE non-annotated miRNAs in human and mouse and looking for their positions in pig assembly, then doing folding to see if secondary structure was equivalent to expected miRNA. Found 12 new miRNAs that way, 7 of them were in fact new miRNAs as they were included in the recently new pig assembly that was released one month ago, and the other 5 were compatible with fragments of circRNAs, but considered as miRNAs in human and mouse, so maybe some fragmentation of circRNA could be involved.

I used bowtie1, it is better suited for small reads than bowtie2, without considering any mismatches in seed (not taking into consideration ismoMIRs), maximum of 20 multimappings and --best --strata mode with one best alignment per read reported.

ADD REPLY • link 7.2 years ago by Emilio Marmol ▴ 180

0

Entering edit mode

Hi V82masae,

I am working on both bovine and pig.

What do you think about this workflow, from miRBase I extracted bovine and pig related miRNAs from mature.fa (miRBase) file. Using miRNA_soft tool as per the research article I provided in the previous comment, I counted the reads mapped to bovine and pig miRNA sequences extracted from mature.fa. I got an output file with the following details:

miRNA id, miRNA sequence, Unique matches and Multi-matches.

I got similar output for all the samples. Then again using miRNA_soft tool, I compared the counts generated in the previous step to generate differential miRNA expression profiling.

I am a naive person in miRNA sequencing. Do you recommend this workflow?

What is miRNA GTD?

Can you share the pig reference genome genbank id or URL?

Why didn't you compare your sequenced reads with miRBase database instead you compared it with the pig, mouse and human reference genomes?

ADD REPLY • link 7.2 years ago by bioinforesearchquestions ▴ 370

score 0 · Answer 1 · 2017-09-05

0

Entering edit mode

7.2 years ago

mforde84 ★ 1.4k

reads per million  = counts per feature / (total reads / 1000000)
counts per million = (counts per feature / total reads) * 1000000

ADD COMMENT • link 7.2 years ago by mforde84 ★ 1.4k

0

Entering edit mode

Ain't these equations same, just different representations?

ADD REPLY • link 7.2 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Watch the StatQuest video I linked. You will see that the difference in the order of the operations is quiet remarkable on the result.

ADD REPLY • link 7.2 years ago by ATpoint 85k

0

Entering edit mode

Order of operations are different for TPM and RPKM. But the question is what is the difference between CPM and RPM. I think there is none and the equations above also do not appear different from each other.

ADD REPLY • link 7.2 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

So your claim is that...

counts per feature / (total reads / 1000000) == (counts per feature / total reads) * 1000000

ADD REPLY • link 7.2 years ago by mforde84 ★ 1.4k

0

Entering edit mode

Well, for me those two equations are the same. Let's say I have a total of 5 million reads per sample i, and 1000 counts per miRNA j, so as you wrote:

RPM = 1000 / (5000000 / 1000000) = 1000 / 5 = 200 CPM = (1000 / 5000000) * 1000000 = 0.0002 * 1000000 = 200

I used the RPM formula you posted for calculations, but contrasting it with cpm function in EdgeR package, the output is the same, so that's why I got confused about terminology. I initially tried counting reads in fastq files (like those are reads, not counts, and then looking for sequences matching the miRNAs of interest, but then I realised that this would make the alignment and counting procedure spurious, so that all statistics were based on counting by assigning to miRNA locations, not read numbers itself. Soooo, I just kept CPM considering as the same of RPM, but I am not sure about that, and that's why I asked...

ADD REPLY • link 7.2 years ago by Emilio Marmol ▴ 180