Question

Differential expression gene analysis

0

Entering edit mode

6.8 years ago

bioinfo456 ▴ 150

I input a count matrix which does not contain any gene id information in it to deseq. I obtained the result table.

Now, how can i obtain the row numbers corresponding to a particular p value, as the row numbers corresponds to gene id in the original data set. Please help.
How can i verify that the subset of genes selected by deseq2 is the best subset?

rna-seq Deseq2 • 2.6k views

ADD COMMENT • link updated 6.8 years ago by h.mon 35k • written 6.8 years ago by bioinfo456 ▴ 150

0

Entering edit mode

How will you get any info if you are using a count matrix without gene id?

ADD REPLY • link 6.8 years ago by Arup Ghosh 3.2k

0

Entering edit mode

Like i already said, row 1 corresponds to first gene in the original dataset. Likewise row 20530 corresponds to 20530th gene. Okay, can you tell me how to feed a data set in excel format to deseq2? Thanks.

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

1

Entering edit mode

Why you removed the gene name in the first place? If you are sure about 20530 will map to 20530th gene then merge the gene_id with res.

ADD REPLY • link 6.8 years ago by Arup Ghosh 3.2k

0

Entering edit mode

I need to train a classifier using the output of deseq2 which is why i decided to go with row numbers that way it is easy for me to extract them. how do you suggest i go ahead?

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

0

Entering edit mode

what about the 2nd part of my question? GSEA perhaps?

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

0

Entering edit mode

your earlier question states that you've got gene ids present in your excel files. Just add them back into your dataset. Presumably your counts are stored in a DGEList (or similar). So you can just add them into the genes entry of that DGEList, or put them back into the rownames. (ps, GSEA will not validate anything about a classifier; all you can do is split up your dataset and cross-validate or compare to a related study or get back to the bench with your selected gene sets)

ADD REPLY • link 6.8 years ago by russhh 5.7k

GenoMax · Answer 1 · 2018-03-20

1

Entering edit mode

6.8 years ago

h.mon 35k

Now, how can i obtain the row numbers corresponding to a particular p value, as the row numbers corresponds to gene id in the original data set. Please help.

As arup said, you don't need to remove the gene names. If you want to subset by p-value:

dds <- DESeqDataSetFromMatrix( countData = counts, 
                               colData = coldata, 
                               design = ~ treatment )
dds <- DESeq( dds )
res <- results( dds, contrast = c( "treatment", "A", "B" ), alpha = 0.05 )

select <- res[( res$padj < 0.05 & !is.na( res$padj ) ), ]

counts( dds )[ rownames( select ), ]

How can i verify that the subset of genes selected by deseq2 is the best subset?

Define "best".

ADD COMMENT • link 6.8 years ago by h.mon 35k

0

Entering edit mode

Thanks for that sir. I still have a couple of clarifications!!

I intend to do differential expression analysis based on pairs ie; untreated vs treated of the same sample. How do you suggest i go ahead?
I had a total of 20530 genes in the data set. Out of which almost 12000 are up or down regulated. What about the remaining 8000 odd genes?

Thanks in advance.

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

1

Entering edit mode

page 42 of the limma use'rs guide will give you an appropriate design

ADD REPLY • link 6.8 years ago by russhh 5.7k

0

Entering edit mode

Thanks, I'll go through that. Any thoughts about the other questions?

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

0

Entering edit mode

no - I didn't see any valid questions. I think you should get the basics done well first.

ADD REPLY • link 6.8 years ago by russhh 5.7k

0

Entering edit mode

May be I'm not making myself clear enough to you.

Which criteria is used to decide whether a gene is up or down regulated?
I had a total of 20530 genes in the data set. Out of which almost 12000 are up or down regulated. What about the remaining 8000 odd genes? Do they fail to reject the null hypothesis? If so, what is the definition of null hypothesis?

I'm pretty sure this is defined somewhere in the manual. Please spare me the time.

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

0

Entering edit mode

You define the null hypothesis and you encode it within your contrasts and your p-value thresholds. Since you haven't posted a line of code, I have absolutely no idea how you selected which genes are up or downregulated. In h.mons code genes were filtered after fdr adjustment using a threshold of 0.05.

ADD REPLY • link 6.8 years ago by russhh 5.7k

0

Entering edit mode

#create a dge list
dds <- DGEList(counts = rkich, genes = genes)

#initialize DESeqDataSet
countData <- dds$counts

condition <- factor(rep(c("Tum","Norm"),each=25))

dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)

#main
dds <- DESeq(dds)

res <- results(dds, alpha = 0.05)

summary(res)

out of 19325 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 5988, 31% 

LFC < 0 (down)   : 5990, 31% 

outliers [1]     : 0, 0% 

low counts [2]   : 4, 0.021% 

(mean count < 0)

[1] see 'cooksCutoff' argument of ?results

[2] see 'independentFiltering' argument of ?results

Where can i define the null hypothesis?

ADD REPLY • link updated 6.8 years ago by GenoMax 148k • written 6.8 years ago by bioinfo456 ▴ 150

2

Entering edit mode

In the line res <- results(...) you defined the null hypothesis; this was done implicitly by DESeq2. Based on your design, (dds <- ... ~ condition), and that you haven't specified any alternative contrast of interest inyour workflow, DESeq2 has assumed that you want to check the null that the coefficient corresponding to the second column of your design matrix is zero (curiously this is: Normal minus Tumour, it might be more natural to define the 'levels' of your condition factor more naturally)

your design is wrong for reasons that you have outlined above - the tumour/normal samples are matched. As I said, get the basics fixed before you read too much into your results. That is, put indicators for patient-of-origin into your design. Please read how to do this in the limma user's guide, in the pages I indicated.

ADD REPLY • link 6.8 years ago by russhh 5.7k

0

Entering edit mode

#create a dge list

dds <- DGEList(counts = rkich, genes = genes)

#initialize DESeqDataSet

df <- data.framePatient.ID = factor(rep(1:3,each=2)), Treatment = 

factor(rep(c("Pre","On"),3),levels=c("Pre","On")))

countData <- dds$counts

dds <- DESeqDataSetFromMatrix(countData, DataFrame(df), ~ Patient.ID + Treatment )

#main

dds <- DESeq(dds)


res <- results(dds, name ="Treatment_On_vs_Pre", alpha = 0.05)

summary(res)

out of 18545 with nonzero total read count

adjusted p-value < 0.05

LFC > 0 (up)     : 2531, 14% 

LFC < 0 (down)   : 2010, 11% 

outliers [1]     : 0, 0% 

low counts [2]   : 2145, 12%

Thanks for your advice. I have accordingly changed my design matrix as follows.

"curiously this is: Normal minus Tumour" Should it be the other way around?

The whole purpose of this experiment is to find out the genes which is significantly affected on its transition from before and after treatment. Does the above mentioned design achieve this goal?

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

1

Entering edit mode

Where can i define the null hypothesis?

If you want to use DESeq2:

1) read carefully DESeq2 documentation: Analyzing RNA-seq data with DESeq2

2) ?results

3) search for DESeq2 paired samples - you will find plenty of useful resources, like here or here.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

Correct me if I'm wrong. As for my understanding,

alpha is a thresh hold value. If the p value of a particular gene is less than alpha, the null hypothesis is rejected and the gene becomes a part of the result. Does that mean that the definition of null hypothesis is that all genes are not significant?
What is the significance of adjusted p value?
Is the concept of up or down regulated genes decided by the log fold change value?
What is the significance of plotMA() in this scenario?

Thanks a lot for your help.

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

1

Entering edit mode

Answer to 1) is taken from from DESeq2 documentation I linked above:

DESeq2 offers two kinds of hypothesis tests: the Wald test, where we use the estimated standard error of a log2 fold change to test if it is equal to zero, and the likelihood ratio test (LRT). The LRT examines two models for the counts, a full model with a certain number of terms and a reduced model, in which some of the terms of the full model are removed. The test determines if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero.

Answer to 2) is taken from ?results:

pAdjustMethod: the method to use for adjusting p-values, see ?p.adjust

Answer to 3) is taken from ?results:

The results table when printed will provide the information about the comparison, e.g. "log2 fold change (MAP): condition treated vs untreated", meaning that the estimates are of log2(treated / untreated), as would be returned by contrast=c("condition","treated","untreated"). Multiple results can be returned for analyses beyond a simple two group comparison, so results takes arguments contrast and name to help the user pick out the comparisons of interest for printing a results table. The use of the contrast argument is recommended for exact specification of the levels which should be compared and their order.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

If you could tell me in simple words the answer to my questions I'd be really greatful.

How do i know if a particular gene is up or down regulated?

Thanks for you quick response.

ADD REPLY • link 6.8 years ago by bioinfo456 ▴ 150

0

Entering edit mode

You never really know

ADD REPLY • link 6.8 years ago by russhh 5.7k