Question

scatterplot in R

0

Entering edit mode

2.2 years ago

bioinformatics ▴ 40

Hi,

Would anyone be able to help me add an error bar to my scatterplot of expression values derived from microarray differential expression analysis.

Please find below the commands I have used to make the graph:

path <- "/Users/DDLPS.csv"
df <- read.csv(path, header =  TRUE, sep = ',')

Plot <- ggplot(df, aes(Samples, Expression.value, colour = Tumour.type)) + geom_point()
print(Plot + ggtitle("Gene expression differences of X between WDLPS and DDLPS"))

Expression values are in the table below.

 head(df)
            Samples Expression.value Tumour.type
    1 GSM766533.CEL        10.013128       DDLPS
    2 GSM766534.CEL         9.293059       DDLPS
    3 GSM766535.CEL        10.821439       DDLPS
    4 GSM766536.CEL        10.494755       DDLPS
    5 GSM766537.CEL        10.736248       DDLPS
    6 GSM766538.CEL        10.067121       DDLPS

I have tried to add the error bar with the following command but received an error message:

 TIMP1Plot <- ggplot(df, aes(Expression.value, Samples, colour = Tumour.type)) + geom_point() +  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 0.05)

Error in mean - sd : non-numeric argument to binary operator

Can anyone help me correct this?

Thanks!

microarray expression gene • 2.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

Why not simply putting a geom_boxplot alongside the scatter?

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

Thanks! Where exactly do I put the geom_boxplot in the command line?

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

+ geom_point() + geom_boxplot()

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

Ok thanks, I have now done this however the error bar appeared on the legend not points. If I wanted to add an error bar for each point on the plot. Does the command line change?

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

Sorry, I didn't read you question through. You should have columns in df named mean and sd for the expression in geom_errorbar to be evaluated with.

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

In response to this comment: You got a bit of a mess here. First of all you switched the parameters Samples and Expression.value, Samples should be on the X (first parameter) and Expression.value on the Y axis (second). Next, what is it that you're plotting? is it a single gene in multiple samples? What is the meaning of the mean and SD here? If you want the mean and SD of the expression in all the samples then geom_boxplot should give you this (it actually gives you something a bit [different][1]). [1]: https://www.r-bloggers.com/2012/06/whisker-of-boxplot/

Thankyou for your feedback. I have now switched Samples and Expression values to the correct parameters. I'm plotting the expression values of a single gene in 92 samples which are either classed as WDLPS or DDLPS tumours. I have calculated the mean expression of all samples and then the SD using the mean.

I have used the following commands:

TIMP1Plot <- ggplot(df, aes(Expression.value, Samples, colour = Tumour.type)) + geom_point() + geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 0.05)

and I still get a graph with all the error bars spread across one point on the y axis.

head(df)
            Samples Expression.value Tumour.type     mean         sd
    1 GSM766533.CEL        10.013128       DDLPS 10.69257 0.48043835
    2 GSM766534.CEL         9.293059       DDLPS 10.69257 0.98960439
    3 GSM766535.CEL        10.821439       DDLPS 10.69257 0.09112390
    4 GSM766536.CEL        10.494755       DDLPS 10.69257 0.13987658
    5 GSM766537.CEL        10.736248       DDLPS 10.69257 0.03088478
    6 GSM766538.CEL        10.067121       DDLPS 10.69257 0.44225940

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

What is the meaning of sd for an individual point? SD is relevant when considering a population. You can compute the mean and SD of the populations you have here (samples in the two tumor types).

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

The sd of an individual point is how spread the point/expression value of the gene in the sample is from the mean expression of all the samples. Also, the smaller the bar the more reliable the value and the larger the bar the less reliable. To calculate this I used the following formula: =STDEV.S(B2:D2) where B2= mean expression, and D2= expression value of the particular sample.

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

You're misusing this formula. STDEV.S is for estimating the standard deviation of a population. What you are computing is basically the distance of each point from the mean which is meaningless in this context.

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

Ok thanks for your help. How might I correctly calculate the SD?

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

Compute it using all the expression values in the population (samples in each tumor type I assume). You will end up with one value for each population, alongside one mean value for each population.

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

Ok thanks. I have ended up with a mean value and sd for each population. Is this table correct?

head(df)
        Samples Expression.value Tumour.type mean.DDLPS   sd.DDLPS
1 GSM766533.CEL        10.013128       DDLPS   10.82059 0.09052158
2 GSM766534.CEL         9.293059       DDLPS         NA         NA
3 GSM766535.CEL        10.821439       DDLPS         NA         NA
4 GSM766536.CEL        10.494755       DDLPS         NA         NA
5 GSM766537.CEL        10.736248       DDLPS         NA         NA
6 GSM766538.CEL        10.067121       DDLPS         NA         NA
  mean.WDLPS sd.WDLPS
1    10.5941 5.306758
2         NA       NA
3         NA       NA
4         NA       NA
5         NA       NA
6         NA       NA

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

Each row is one sample so the mean and sd should not be a part of the table. There are elegant ways to plot the mean and SD of a population with ggplot, boxplot is one and the most common (median instead of mean but it is a population description).

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

Ok thanks for your response. If the sd and mean values are not part of the table where should I put them?

I'm hoping to end up with 2 error bars, one across the WDLPS points and another across the DDLPS points.

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40

0

Entering edit mode

You can use stat_summary with the function mean_cl_boot for instance.

ADD REPLY • link 2.2 years ago by Asaf 10k

0

Entering edit mode

Thanks for your help, it still didn't work.

ADD REPLY • link 2.2 years ago by bioinformatics ▴ 40