Entering edit mode
2.0 years ago
Manav
•
0
Hi All, I recently got my RNA seq data which has both 'gene_count_matrix' and 'transcript_count_matrix' files. I have performed the DGE analysis in edgeR using gene_count_matrix file.
I wanted to get transcripts per million matrix for downstream analysis. Can anyone explain how can I do that? Or direct me where I can find more information to this?
Thanks a lot.
Hi Manav,
What was the algorithm that you employed for estimating the abundance of your genes/transcripts? I mean RSEM, kallisto, salmon...
Bests,
Rodo
Hi, I am not sure. We had outsourced the sequencing to a company that gave us the raw counts for both gene and transcript after doing the sequencing. This is what they have mentioned in their methods:
After the final transcriptome was generated, StringTie and ballgown was used to estimate the expression levels of all transcripts.
Allright,
After reading the reference manual of StringTie it looks like that the algorithm produces a raw count matrix as input for DESeq or edgeR. In this case, I suggest you to follow the next steps estimate the TPM for each gene/transcript by using your raw counts matrix. For calculating the length of each gene, use the biomaRt package to retrieve the start and end coordinates of the genes.
Bests,
Rodo
That will work for transcripts, not genes.
Hi swbarnes2, Just to confirm, the above method suggested by Rodo will be useful only if I use the transcript counts file right?
Gene length is not meaningful here, because you don't want to include introns, and if you have multiple transcripts of different lengths present for one gene, you'd have to account for that too. Programs like RSEM and Salmon will do this math for you, but it would be tricky to do it yourself. Transcript-based TPM would be much more straightforward to do yourself. One transcript has one length only.
Hi, Yes I will try to use the transcript counts as you have guided. Thanks.