Question

Converting transcript_count_matrix to TPM values obtained from RNA seq data

0

Entering edit mode

2.0 years ago

Manav • 0

Hi All, I recently got my RNA seq data which has both 'gene_count_matrix' and 'transcript_count_matrix' files. I have performed the DGE analysis in edgeR using gene_count_matrix file.

I wanted to get transcripts per million matrix for downstream analysis. Can anyone explain how can I do that? Or direct me where I can find more information to this?

Thanks a lot.

TPM edgeR • 1.7k views

ADD COMMENT • link updated 2.0 years ago by swbarnes2 14k • written 2.0 years ago by Manav • 0

0

Entering edit mode

Hi Manav,

What was the algorithm that you employed for estimating the abundance of your genes/transcripts? I mean RSEM, kallisto, salmon...

Bests,

Rodo

ADD REPLY • link 2.0 years ago by rodolfo.peacewalker ▴ 390

0

Entering edit mode

Hi, I am not sure. We had outsourced the sequencing to a company that gave us the raw counts for both gene and transcript after doing the sequencing. This is what they have mentioned in their methods:

After the ﬁnal transcriptome was generated, StringTie and ballgown was used to estimate the expression levels of all transcripts.

ADD REPLY • link 2.0 years ago by Manav • 0

0

Entering edit mode

Allright,

After reading the reference manual of StringTie it looks like that the algorithm produces a raw count matrix as input for DESeq or edgeR. In this case, I suggest you to follow the next steps estimate the TPM for each gene/transcript by using your raw counts matrix. For calculating the length of each gene, use the biomaRt package to retrieve the start and end coordinates of the genes.

Bests,

Rodo

ADD REPLY • link 2.0 years ago by rodolfo.peacewalker ▴ 390

0

Entering edit mode

That will work for transcripts, not genes.

ADD REPLY • link 2.0 years ago by swbarnes2 14k

0

Entering edit mode

Hi swbarnes2, Just to confirm, the above method suggested by Rodo will be useful only if I use the transcript counts file right?

ADD REPLY • link 2.0 years ago by Manav • 0

0

Entering edit mode

Gene length is not meaningful here, because you don't want to include introns, and if you have multiple transcripts of different lengths present for one gene, you'd have to account for that too. Programs like RSEM and Salmon will do this math for you, but it would be tricky to do it yourself. Transcript-based TPM would be much more straightforward to do yourself. One transcript has one length only.