Background: I have been learning how to analyze RNA-seq data for my research lab - our model organism is Arabidopsis thaliana. I have ~2 years of experience working with R, and feel comfortable learning how to use new functions/packages. I can comfortably read and edit scripts written by others, but am still practicing writing my own. The steps that I have completed so far for RNA-seq analysis have been performed either in R or with a user-friendly online platform. I have only some basic knowledge of Python.
I would like advice on how to proceed with my RNA-seq analysis. I have specific ideas in mind, but would appreciate receiving some direction along the way. Additionally, if something I describe is pure nonsense, I would appreciate someone correcting me.
Experimental design: Four different groups (genotypes), with and without treatment, three biological replicates each (twenty four samples total, if my math is correct)
Steps I have completed:
- I have completed the initial steps of aligning the reads to the transcriptome (I used Kallisto for pseudo-alignment) and quantification of transcripts.
- I have also performed differential expression analysis using the Sleuth package in R.
- For visualization of the results, I am using Integrative Genomics Viewer (IGV)
Now that I have information about differential expression, I would like to make sense of the data (i.e. compare within the same group with or without treatment, and between groups after treatment). I have some ideas/questions that I will post below.
I am familiar with Gene Set Enrichment Analysis using Gene Ontology terms, but I am also interested in what I believe is called pathway analysis. I have downloaded annotations for Arabidopsis metabolic pathways from AraCyc, and I'm sure there are other databases such as KEGG. Could someone point me in the direction of how to accomplish this?
After applying the pathway annotations, I would like to be able to visualize the results somehow, which might help me be able to compare between the different groups. I am not yet sure how to reach this point, but I believe this is the next step.
Also, is it common to only use annotations for a single database at a time, or can two annotation databases be combined? I have done a little reading on an algorithm known as SetRank which exists for this purpose. However, I would like to learn the 'usual' methods before others.
Ultimately, I would like to be able to perform what I believe is called correlation network analysis. For example, I would like to be able to create a visual network showing how transcription factors are associated with metabolic pathways. I've read a few papers that have done this using Cytoscape - metabolic pathways are represented by circles and transcription factors by triangles. The closer the triangles are to the circles, the 'stronger' the association - in addition, lines running between the transcription factors and metabolic pathways are colored red or blue to represent positive or negative correlation. I do not currently understand how to produce the data required for input into Cytoscape, but I believe I am capable of learning given some direction.
Most of what I have learned about analyzing RNA-seq data, I've learned from reading journal articles and applying them to my lab's data. However, the gaps in my knowledge and my learning methods out of order have left me without a sense of direction. I would be extremely grateful for any advice, resource links, or general clarification.