Hello all,
I'm relatively new to bioinformatics, and am trained primarily as a molecular biologist, so please bear with me if my questions seem quite rudimentary.
I've received a dataset of RNA-sequence data that has already been converted from its native form to an Excel spreadsheet. It contains data from 2 conditions, one experimental and one control. Each is done in replicate, and both the raw values and the log transformed values are present. I'm trying to utilize this data to compile a list of differentially expressed genes, so my questions are as follows:
1) Is there a way in Excel where I can isolate the differentially expressed genes from the large dataset that I have now? Is there a way on Excel to statistically analyze the difference in expression of each gene, and only isolate the ones which have a difference in expression of p<0.05?
2) From this dataset, is there a way on Excel to individually isolate the genes which have been up-regulated and down-regulated separately?
Again, I apologize if these questions are basic, and would greatly appreciate any assistance from the community.
Channeling Pierre Lindenbaum 's spirit,
Although I understand R is probably a much more efficient method to do the tasks I've described, my experience with R and programming in general is quite limited. In the long term, I'm aiming to become proficient in the language, and would thus in the interim appreciate any suggestions for me to conduct the analyses on Excel.
Please don't. You cannot reproduce anything you do on Excel, and we cannot give you specific instructions to be implemented on Excel. What you have is a matrix of numbers, most probably, and you cannot run statistical analyses on Excel. Import the data into R and people here will be able to help you much better. Plus, all of your analyses can be recorded, reproduced and debugged. Excel is the worst possible way to do an analysis of this magnitude.
Sorry to say, but don't even try in Excel. RNA-seq analysis is quiet simply from the actual user's perspective because excellent standard software is available that provides the necessary statistical framework, but these are implemented in R. If you lack the knowledge, I recommend either working yourself into it, e.g. by following the DESeq2 guide plus the web for R help, or try to collaborate with an experienced R bioinformatician. Trying to put together custom solutions in Excel is not recommended, especially if you are not an expert in statistics.
I see. Thank you for the replies @ATpoint and @RamRS. I am currently looking into DESeq2. How would you guys suggest that I proceed from here? Will I have to train myself in R or should I jump to looking at the DESeq tutorial? My PI has asked for data analysis relatively soon.
You should learn R as you're working on DESeq2. Once you learn how to read data into a
data.frame
in R and then subset it by picking rows or columns, you can do anything in R that you can in Excel. DESeq2 might have its own objects, so the tutorial should walk you through that.Most of bioinformatics is in R/python, you'll almost never need MATLAB.
Additionally, would I be able to perform the analyses in MATLAB, or is R preferred?
Have you tried using SeqGeq?
No, I haven't. Is it an open source software?
Doesn't look open source or even free.
Try BRB array tools to analyze RNAseq data in excel mlai2567. link to brbarray tools: https://brb.nci.nih.gov/BRB-ArrayTools/Documentation.html