Hi there,
I am currently working on a project related to pan-cancer analysis. I have done differential expression analysis for miRNAs and genes with edgeR. I know that edgeR only takes raw counts as input, so I downloaded the HTSeq-Count data from GDC data portal. After the differential expression analysis, I'd like to obtain the normalized expression values of miRNA/genes (RPKM/FPKM) for the downstream analysis, such as using pearson correlation between miRNAs and mRNAs to construct a miRNA-mRNA regulatory networks and so on. However, I got stuck for days on how to get the normalized expression values of both miRNAs and genes. Here are my questions:
There is a function called "cpm"(counts per million) in edgeR, but it says it doesn't take the gene length into accout; edgeR also provides another version of normalized counts "pseodu.counts", however, someone says this is quite difficult to interpret. So I am wondering if I could use "logCPM" as the normalized expression values for the downstream analysis?
If not, I realized that there is also a function called "rpkm" in edgeR which could calculate the normalized expression values. However, it needs the gene/microRNA length information to make it work. I do not know where to find the length information for genes and microRNAs, since there is no such information contained in the HTSeq-Count file. Could any one please tell me how to do it? Should I download the gene information from ENSEMBLE and the miRNA information from mirbase? And calculate the length information by myself? Is there any R package that could do the work instead?
Could I just download the RPKM files of miRNAs and genes from GDC data portal to construct the miRNA-mRNA regulatory network? Is that right? It seems to be the easiest way for me to do though....
Any help would be really appreciated.
Could you please tell me where to download the GTF file? It seems that TCGA does not have this file for genes? Also, how can I obtain normalized miRNA expression data from HTSeq-Count data then? Thanks.
I assumed that you already have the RPKM from this "Could I just download the RPKM files of miRNAs and genes from GDC data portal"
I would suggest to speak to someone in your workplace who does bioinformatics. If you are starting with RNA-Seq analysis and wants to work with TCGA, it needs lot of work and guidance.
I eventually found the GTF file from the ensembl website. Anyway, really thanks for your help. I am working on it.