Entering edit mode
4.2 years ago
ginny
•
0
I have just started working on TCGA data, and I observed that the RNA-seq (HT-Seq counts) files also have the ENSEMBL gene ids for miRNAs, which means that the expression values of miRNA genes are also present in the RNA-seq files.(?)
So then why does TCGA have a separate miRNA quantification dataset (files ending with .mirbase.mirna.quantification)?
I am confused because I plan to find both the differentially expressed genes as well as miRNAs, and don't know which dataset to consider for DESeq2.
Please help! :(
You need to download them separately.
The mirbase.mirna.quantification files are what you want for miRNA DE analysis. You will want to subset the HT-Seq counts too if they contain roughly 50,000 rows (harmonized data) to contain only coding genes ~20,000
Thank you so much! Any idea how can I filter out only the coding genes?
I have code here (https://github.com/BarryDigby/TCGA_Biolinks/blob/master/TCGA_Biolinks.Rmd) that does everything you want: download data, prepare metadata, filtering coding genes, differential expression analysis. It's a good basic template to start with.
It was conducted on TCGA PRAD. Install packages as required, change PRAD to your tissue type of interest and you are good to go.