Question

Only one gene is upregulated?

1

Entering edit mode

3.5 years ago

mskr_ ▴ 20

I just start learning bioinformatics from RNA-seq/DEGs. I am trying to replicate an article. I very much appreciate any help. I used salmon in combination with tximport and DESeq2. PCA plot showed 77%PC1 8%PC2. Apparently, different conditions are mixed and not well separated. Is it healthy to continue this analysis? PCA

library(tidyverse) 
library(tximport) 
library(DESeq2)

sample_table <- read.csv("SraTable.txt") %>% arrange(sample_table$Run) %>% select(Run)

sample_table$Run <- c("pre_ischemia1","ischemia1","pre_ischemia2","ischemia2","pre_ischemia3", "ischemia3","pre_ischemia4","ischemia4","pre_ischemia5","ischemia5")

SRR10700832_quant/quant.sf sample_files <- paste0(c("SRR10700832_quant","SRR10700833_quant","SRR10700835_quant","SRR10700836_quant","SRR10700838_quant", "SRR10700839_quant","SRR10700841_quant","SRR10700842_quant","SRR10700844_quant","SRR10700846_quant"), "/quant.sf")

names(sample_files) <- pull(sample_table, Run)
gene_map <- read.csv("gene_map.csv", col.names = c("enstid", "ensgid")) 
count_data <- tximport(files = sample_files, type="salmon", tx2gene= gene_map, ignoreTxVersion = TRUE)
sample_table <- as.data.frame(sample_table)
sample_table$condition <- factor(rep(c("pre_ischemia", "ischemia"), times=5), levels = c("pre_ischemia", "ischemia"))

deseq_dataset <- DESeqDataSetFromTximport(txi= count_data, colData = sample_table, design = ~condition) 
deseq_dataset <- DESeq(deseq_dataset)
vst <- varianceStabilizingTransformation(deseq_dataset)

plotPCA(vst, intgroup='condition')
results <- results(deseq_dataset)

summary(results)

res <- results(deseq_dataset) yielded following;

out of 38464 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 1, 0.0026%
LFC < 0 (down)     : 0, 0%
outliers [1]       : 263, 0.68%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

There is only 1 gene upregulated, zero down and zero low counts which are not likely. Which step could be the problem ? I would like to know sanity checkpoint in the entire flow.

Thank you for your time

P.S sorry it this hurts your eyes. I couldnt format the code in the post.

DESeq2 gene low R • 2.1k views

ADD COMMENT • link updated 3.5 years ago by Carlo Yague 8.9k • written 3.5 years ago by mskr_ ▴ 20

1

Entering edit mode

what is the result of plotPCA(vst, intgroup='condition') from your code?

ADD REPLY • link 3.5 years ago by cpad0112 21k

0

Entering edit mode

Hi, 77%PC1 8%PC2 I added the plot

ADD REPLY • link 3.5 years ago by mskr_ ▴ 20

0

Entering edit mode

p-value < 0.1

ADD REPLY • link 3.5 years ago by tothepoint ▴ 940

0

Entering edit mode

Hi thank you for the reply, I changed it to 0.05, but there is no difference since there is one gene that fulfills that condition.

ADD REPLY • link 3.5 years ago by mskr_ ▴ 20

0

Entering edit mode

How can anyone help you with code we cannot see?

ADD REPLY • link 3.5 years ago by swbarnes2 14k

0

Entering edit mode

Hi thank you for your reply, since it didn't give any error I thought there is something fundamentally wrong. I updated the code

ADD REPLY • link 3.5 years ago by mskr_ ▴ 20

0

Entering edit mode

@mskr_ You can enable proper code formatting following the markdown rules, so wrapping code blocks in ``` or by selecting the code chunk and then using the 10101 button. We made the changes for you now :)

ADD REPLY • link 3.5 years ago by ATpoint 85k

0

Entering edit mode

Hi, I placed the long code in a place after clicking 10101 button but it didnt update. Just trying to figure out, sorry and thank you :)

ADD REPLY • link 3.5 years ago by mskr_ ▴ 20

1

Entering edit mode

3.5 years ago

Carlo Yague 8.9k

Hi mskr_,

Your interpretation of the PCA plot is correct. The dispersion of samples in the PC1-PC2 space, not clustering according to condition, indicates that the residual variability is much higher than the factorial variability (condition effect) in your sample. In consequence, it is not surprising that you find only one DEG in the differential expression analysis.

There are multiple hypothesis that might explain the lack of condition effect:

There could be truly few difference between the transcriptome of pre-ischemia and ischemia cells. Not much you can do about this.
The sample labeling might have been mixed up. You need to double check that SRR10700832 is really pre-ischemia and so forth.
There could be another source of variability (batch effect, etc...) that strongly affects the measure of the transcriptome and hides the true signal. If you identify such an explainable source of variability, you might want to include that in the model to control for it which will possibly improve your results.

In the article you try to replicate, were there many DEG ? Have you read in detail how they analyzed their data ?

ADD COMMENT • link 3.5 years ago by Carlo Yague 8.9k

0

Entering edit mode

Thank you for the detail explanation. As ATpoint pointed out , they filter based on pvalue, rather than padj. Then it fits.

Just a curiosity what pc1 /pc2 values are off the limits? Or you would consider too much?

ADD REPLY • link 3.5 years ago by mskr_ ▴ 20

1

Entering edit mode

Units of PC are difficult to interprets since they reflect the combination of multiple variable (thousands variable in this case). So usually I don't really look at PC values, just at the percentage of variance explained by each component, and how the samples are distributed according to the PCs. If you want more insight into PCA interpretation in the context of RNA-seq, there are multiple resources out there, including this one : https://hbctraining.github.io/DGE_workshop/lessons/03_DGE_QC_analysis.html

ADD REPLY • link 3.5 years ago by Carlo Yague 8.9k

score 3 · Accepted Answer · 2021-05-28

Hello,

I just played with these data a bit as the salmon quantification directories were available at GEO, so this was no effort.

It seems the authors filtered on p-value based on the methods text...

The significantly differentially expressed genes (DEGs) between groups were defined at cut-off criteria of |log2 fold-change| ≥ 1 and p-value < 0.05.

...instead of padj. When you check some of the top20 DEGs that they provide in Supp. Table S3 you will see that these "significant" ones only pass the p-value cutoff but all have a padj >>> 0.05, so there indeed simply are no DEGs at standard cutoffs, at least not without correcting for potential batch effects. I did not look into the paper at any detail, but this at least seems to "replicate" the results they have. Not discussing now whether this strategy they chose was good or not.