I'm a student doing a data analysis project and have been given a dataset containing the mean reads per kilobase per million mapped reads (RPKM) for 6 mRNA samples that have undergone high throughput sequencing, for about 11000 genes. The samples have been split into 2 categories to compare so I'm assuming the RPKM have been averaged or something as they're is just 2 columns of RPKM values, I column for each category and rows relate to the gene in question. How do i find which genes have changed expression, I assume I have to use a program like r? I have some previous experience using r, but using rcmdr package and I don't know if I have to use a different package here? Any help is much appreciated and Thank you in advance :)
It's rather bad that the RPKM's are averaged as you describe it, you should have individual measurements of each sample to estimate dispersion and biological/technical variability unrelated to your trait/treatment of interest.
yes all i have been given is a table with 3 columns, 1st is the refseq for the gene in question and the next two colums are the rpkm for the 2 categories (viable and non viable) i havent been told that they were averaged but since i dont have 6 different rpkm values for each gene i'm assuming they have been
I think details here are not so important. You have 11000 samples of hypothesis testing problem. Using simple tests like t-test can be useful! I emphasis that I agree with @decosterwouter that you must not compare just average values and should use variation in the samples.
There exists advanced methods for computing differentially expressed gens(DEG) most of them are developed for microarray data. But you can find some methods for finding DEGs in rna-seq data like DEGseq.
ok how does DEGseq work? is it just a package in r that will compute expressional difference from the rpkm values? and how do you mean using t tests can be helpful?
I strongly encourage you to follow MIchael's advise below. This analysis is a complete waste of everyone's time.