Question

working out differntially expressed genes

0

Entering edit mode

10.2 years ago

dineshtripathy9658 ▴ 10

i have got approx 2500 lncrna and want to find out the differentially expressed genes. I fetched the data for the lncrna from gene_exp.diff. now some of the fpkm values in both control and stress are 0. I have read in a paper that first normalize fpkm values by adding 0.0001 then calculate foldchange and for differentially expressed genes proceed as

upregulated: fold change>=2 and p value <=0.05

downregulated:fold change<=0.5 and p value <=0.05

yet in another paper I read that first filter out fpkm >=0.1 in any tissue.

then after filtering proceed with adding 0.0001 to fpkm and then calculate upregulated and downregulated.

my question: which way to proceed and what is the difference between the two?

next-gen • 2.7k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by dineshtripathy9658 ▴ 10

0

Entering edit mode

I can't really tell given the information you have- I am guessing you use Cufflinks. However, FPKM, RPKM and others should always be taken with a pinch of salt. You need to know what tools were used to align the transcripts, and how the counting process was done. It would also help if you could post what samples you have, and what conditions you were testing (different tissues, times series, different treatments?).

The logic is that a few reads aligned to a gene don't really mean anything (it is the law of high numbers - a better coverage/sequencing depth means a better approximation of the 'real' expression).

Typically, differentially expressed genes are represented as a MA plot: the expression level vs the fold change. If a gene is well expressed and changes a lot, it is a good candidate. Otherwise, you can't conclude.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by cyril-cros ▴ 950

0

Entering edit mode

Yes, cufflinks has been used. The samples are 3 rice cultivars along with the conditions control, dessication and salinity. So how can I proceed in such a case?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by dineshtripathy9658 ▴ 10

Ram · Answer 1 · 2015-05-22

I mainly use R for this task. Download an annotation of the rice gene CDS in gtf/gff format [but you got one I guess]. You can then use:

CummeRbund: the follow-up tool in the Cufflinks workflow. It turns your cufflinks file into a database and allow powerful statistical analysis. See the Cufflinks website, tool section.
The I-don't-trust-Cufflinks-a-lot route:
- read the quick intro to DeSeq2 (http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf). You get hints on how to get your data in right format.
- Do read the HTSeq / bedtools multi-cov manual (tools for counting aligned reads, lots of rules on what reads should be counted or not). This step and the alignment are the most important ones.
- go to http://pastebin.com/3FB8vuLp for a model of script (courtesy of my teachers) and some nice graphs. Works on two conditions (hypoxic or not, yeast C. albicans, so you need to adapt).

The idea is that you can choose how you count reads / normalize counts with DeSeq.

You can then use tools such as MultiExperimentViewer to cluster your most differentially expressed genes, and do gene ontology/enrichment search (see Go Finder).

Those steps are not trivial, I suggest you find someone experimented to help you in your lab. As a joke, my teachers asked my class to find the top 10 most differentially expressed genes in a simple data set and there were lots of differences between our answers. The thing that matters most is that you have an understanding of the assumptions you make at each step (most genes have a stable expression across your conditions / you disregard reads which align to several places / etc...). Choosing to keep only genes with a minimal FPKM is such an example...

Also, do statistical tests and look at p-values. You will always get a huge list of candidate gens, you need to select only the most relevant ones. Use RT-qPCR for confirmation.