Hey all,
This is my first time posting, so I hope this question isn't too open ended; sorry if it's a bit long. Anyways, I'm a current bioinformatics masters student, and I've just joined up with a cancer research lab as an intern. They have RNA-seq data that they received from outsourcing their sequencing, and from it, they'd like me to get them a list of the most significant differentially expressed genes by fold changes and p-values.
The problem is that they don't have the raw data. The facility they outsourced their sequencing too did some of the data analysis for them, so what I have to work with is a data frame for each trial, the control and three separate tests. Each data frame contains the gene ID, the transcript ID(s), the length, the expected count, and the FPKM.
This type of analysis is new to me, and in reading how to complete this task using tools such as edgeR, it seems as though it's important to have the raw read counts, which unfortunately I don't have, and don't think I can get. I don't believe that the expected_count is the same thing is it? They do supply an equation for the FPKM as FPKM = (10^6 * C) / )N * L / 10^3); where C is the number of fragments uniquely aligned to the gene, N is the total number of fragments that are uniquely aligned to all genes, and L is the number of bases on the gene. Would C in this equation be equal to the raw read count? It appears to be approximately equal to the expected read count.
Any ideas on how to solve this problem are much appreciated!
Don't waste your time with this. Your group payed for the sequencing, whoever did it will happily give you the fastq files.
Yeah I'm going to attempt to get the fastq files. I was just wondering if there was any use to what I have currently. I believe that it's RSEM output.
¯_(ツ)_/¯
In all seriousness, do you know how did they produced the expected counts? Because if you do, you might be able to use tximport to produce counts and afterwards use edgeR :). HOWEVER! Take into account that for you to publish it is likely you will be asked to upload the raw data to a publicly available website.
So looking into the problem a bit more, it looks like this is the direct output from RSEM. With the expected counts being:
"'expected_count' is the sum of the posterior probability of each read comes from this transcript over all reads. Because 1) each read aligning to this transcript has a probability of being generated from background noise; 2) RSEM may filter some alignable low quality reads, the sum of expected counts for all transcript are generally less than the total number of reads aligned."