Question

Gene labels problem in enhanced volcano

0

Entering edit mode

17 months ago

anasjamshed ▴ 140

I have gene expression data from which I selected 16 genes stored in the df4 variable and tried to make an enhanced volcano:

library(EnhancedVolcano)

# Specify target genes for labeling
target_genes <- c(
  "ACTN2", "CRYAB", "BMP10",
  "CSRP3","DES","FHOD3",
  "FLNC","LDB3","MYZAP",
  "MYPN","MYOZ2","NEXN",
  "PDLIM3","PDLIM5",
  "TCAP","TTN"
)

# Subset data frame to include only target genes
df4 <-subset(df3, gene %in% target_genes)

# Create EnhancedVolcano plot with labels for the subset of target genes

# Assuming your data frame has a column named 'gene' for gene names
p8 <- EnhancedVolcano(
  df4,
  lab = rownames(df4),  # Use the 'gene' column for labels
  x = 'logFC',
  y = 'P.Value',
  title = 'Dilated Cardiomyopathy vs Control',
  pCutoff = 0.05,
  FCcutoff = 0,
  pointSize = 4.0,
  labSize = 3.0
)
p8
ggsave(file="enhancedvolcano.jpeg",plot = p8)

But the problem is that it does not label the green dots where one condition is true:

enhvolc

How can I solve this problem? Do I need to call some other arguments in EnhancedVolcano function?

Also, rownames(df4) gives following gene labels:

'PDLIM3''FHOD3''TTN''FLNC''LDB3''MYOZ2''ACTN2''CSRP3''CRYAB''PDLIM5'

R ggplot2 enhancedvolcano • 4.8k views

ADD COMMENT • link updated 17 months ago by Mensur Dlakic ★ 29k • written 17 months ago by anasjamshed ▴ 140

Ram · Answer 1 · 2024-02-01

3

Entering edit mode

17 months ago

Ram 45k

Always read the manual first.

EnhancedVolcano has a way to handle your use case without you needing to do much: https://bioconductor.org/packages/release/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html#only-label-key-variables

Also, don't draw just your genes of interest. Draw everything and highlight your genes of interest - this is what I talked about in your previous post.

ADD COMMENT • link 17 months ago by Ram 45k

0

Entering edit mode

Yes, I also tried to draw all genes and highlight genes of interest plot with labels like this :

# Create an EnhancedVolcano plot with labels for the subset of target genes
p8 <- EnhancedVolcano(
      df3,
      lab = as.character(subset_df$gene),  # Extract gene names as a character vector
      x = 'logFC',
      y = 'P.Value',
      title = 'Dilated Cardiomyopathy vs Control',
      pCutoff = 0.05,
      FCcutoff = 0,
      pointSize = 3.0,
      labSize = 6.0
    )

but it gives me the following error:

Error in `$<-.data.frame`(`*tmp*`, "lab", value = c("PDLIM3", "FHOD3",  : 
  replacement has 10 rows, data has 3882

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

2

Entering edit mode

Did you even read the manual section I linked to? I've highlighted the relevant parts.

In many situations, people may only wish to label their key variables / variables of interest. One can therefore supply a vector of these variables via the ‘selectLab’ parameter, the contents of which have to also be present in the vector passed to ‘lab’.

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

yes, I tried this :

# Assuming your data frame has a column named 'gene' for gene names
p8 <- EnhancedVolcano(
  df3,
  lab = df3$gene,  # Use the 'gene' column for labels
    x = 'logFC',
    y = 'P.Value',
    title = 'Treatment vs Control',
    selectLab =  c("ACTN2", "CRYAB","CSRP3","FHOD3","FLNC","LDB3","MYOZ2","PDLIM3","PDLIM5","TTN"),
    #xlab = bquote(~Log[2]~ 'fold change'),
    pCutoff = 0.05,
    FCcutoff = 0,
    pointSize = 2,
    labSize = 2,
    colAlpha = 0.7,
    legendPosition = 'right',
    legendLabSize = 8,
    legendIconSize = 3)
p8
ggsave(file="enhancedvolcano.jpeg",plot = p8)

and its giving me a plot: emhvolc

Still, it's showing 5 genes only. How can I show all 10 available genes?

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

1

Entering edit mode

They could be right on top of each other. Try the boxedLabels and drawConnectors options. Also, it looks weird that your volcano plot is all green and red. Don't pick meaningless thresholds for logFC/p-value, use sensible thresholds so most of the dots are grey.

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

what is a sensible threshold in my case?

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

2

Entering edit mode

Maybe use 1?

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

for upregulated genes, we normally take logfc>0 and pvalue<0.05 and for downregulated genes, we take logfc<0 and pvalue<0.05. Is this a standard thershold

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

2

Entering edit mode

Ram already answered your question:

Maybe use 1?

It is important to understand the meaning of LogFC, and it doesn't seem like you do. LogFC of 0 means fold-change of 1, which is identical expression (or no fold-change between the conditions). To pick a LogFC > 0 literally means that any fold-change greater than 1 will be picked for coloring, and that is going to be pretty much all the points in your plot. Even a fold-change of 1.00001 between the two conditions will be colored, which makes no sense.

What Ram suggested, a LogFC > 1, means to color only genes where the difference in expression is at least 2-fold, which makes more sense and in general is an accepted threshold. In practical terms as it relates to your plot, that means only points to the right of +1 on the X-axis will be colored, and only points to the left of -1 on the X-axis will be colored. That should clear the picture so that hopefully your genes of interest are visible.

ADD REPLY • link 17 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

I change fccutoff to 1 and it gives the following plot: enter image description here

But no red dots and result looking meaningless

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

1

Entering edit mode

no red dots

There are plenty of red dots. You're only supposed to have minimal genes that are observed at that threshold.

result looking meaningless

On the contrary, your earlier plot was meaningless. This one looks like every other Volcano plot out there.

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

NS simply means that both logfc and p-value conditions are not satisfied. Blue dots show that only the value condition is satisfied and 2 genes are true. So how will I interpret it? Which genes are statically significant? Just 2 genes?

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

2

Entering edit mode

The blue dots are still one level below the red ones in significance. You should be looking at creating genes of interest from the DE genes, not the other way around. This plot shows me that a couple of your genes of interest are minimally DE to a good level of certainty (wouldn't even count as DE for the most part but we're scraping the bottom of the barrel) and none of them are significantly differentially expressed to a good level of certainty. That is, there are a few genes that have (pval < 0.05 && 0 < abs(logFC) <1) but none where (pval < 0.05 AND abs(logFC) > 1). I'd also look at the padj - there are probably no genes among your genes of interest that are actually blue or red.

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

so can we say these 2 genes PDLIM3 and FHOD3 are statistically significant?

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

1

Entering edit mode

Sure but significant for what? They logFC is too low for them to matter. Please read Mensur Dlakic 's excellent summary of why none of your genes are actually meaningful

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

so if none of my genes are meaningful then what should I do? Should I change logfc and pvalue? How can I include it in my study?

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

1

Entering edit mode

so if none of my genes are meaningful then what should I do? Should I change logfc and pvalue?

You are asking if you should change the meaning of the word "meaningful" if there is nothing meaningful in your current question. Please think about what you're saying.

How can I include it in my study?

You seem to have made up your mind about these genes being important to you regardless of their significance as observed in your experiment. No one can help you there.

Like I said earlier, "You should be looking at creating genes of interest from the DE genes, not the other way around"

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

But my target genes are

PDLIM3
FHOD3
TTN
FLNC
LDB3
MYOZ2
ACTN2
CSRP3CRYAB
PDLIM5

First I find it in 6 geo samples and then did analysis in 2 conditions

ADD REPLY • link updated 17 months ago by Ram 45k • written 17 months ago by anasjamshed ▴ 140

0

Entering edit mode

I did the test and find :

data:  diffexp_data$logFC
t = 1.5344, df = 9, p-value = 0.1593
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.04855688  0.25329738
sample estimates:
mean of x 
0.1023703

The data that I used was 3882 differentially expressed genes

ADD REPLY • link updated 17 months ago by Ram 45k • written 17 months ago by anasjamshed ▴ 140

0

Entering edit mode

There are 3882 DE genes? DE genes should have both logFC>1 AND pval < 0.05.

What means are you comparing?

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

Basically from 3382 genes, I selected 10 target genes between 2 conditions(Dilated cardiomyopathy vs control) and ran the t-test based on logfc column and it gives the following results :

data:  diffexp_data$logFC
t = 1.5344, df = 9, p-value = 0.1593
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.04855688  0.25329738
sample estimates:
mean of x 
0.1023703

ADD REPLY • link updated 17 months ago by Ram 45k • written 17 months ago by anasjamshed ▴ 140

0

Entering edit mode

Please consult a statistician - I cannot help you any more. Here's the last piece of direction I have for you - one your genes is TTN, which is the longest protein in the human proteome and not accounting for that WILL influence your observations.

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

every observation is septate so how can TTN effects my results?

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

0

Entering edit mode

Please consult a statistician. I cannot help you any more.

ADD REPLY • link 17 months ago by Ram 45k

0

Entering edit mode

Thanks. Can I take logfc as 0.5

ADD REPLY • link 17 months ago by anasjamshed ▴ 140

1

Entering edit mode

None of the genes you seem to be interested in would have significantly changed expression even if you take LogFC=0.5, as all of their absolute LogFC values are < 0.5. In case you don't know what that means, LogFC=0.5 means 1.41-fold change in expression (2^0.5). As to whether you could take that cutoff: some people will accept 1.41 DE as significantly changed if it is also significant according to p-values, and others will not. Many people, myself included, like to see at least 2-fold change in expression. In your case it doesn't matter much because only two genes satisfy the p-value cutoff, and their absolute LogFC values are small.

You have received plenty of feedback here, and your eyes should be telling you something that your brain possibly refuses to accept. There is no point in asking the same question but in different ways, which is pretty much what you have been doing for the past 2-3 days. I suggest you read some DE papers and tutorials on the internet and hopefully it will become clear why in your results there is not much that will inspire confidence in most people.

ADD REPLY • link 17 months ago by Mensur Dlakic ★ 29k