Question

Too Many Deg With EdgeR Output?

0

Entering edit mode

11.2 years ago

biotech ▴ 570

Hi there,

I'm dealing with bacterial RNA-seq analysis. My experiment is very simple. Two samples to compare and no replicates. Reads were generated with Ion Torrent PGM using 316 chip. One for each sample and performed in different days.

Since I had too many differentially expressed genes, ¿Should I be more conservative assigning edgeR dispersion value? Also, there are considerable more up-regulated genes in exvivo than in plate sample.

See logFC_vs_logCPM figure:

https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp=sharing

Thanks for you help, Bernardo

tmap code:

tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose

tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose

Flasgstat:

exvivo:

3240242 + 0 in total (QC-passed reads + QC-failed reads)

2132481 + 0 mapped (65.81%:nan%)

plate:

3774075 + 0 in total (QC-passed reads + QC-failed reads)

3510438 + 0 mapped (93.01%:nan%)

count:

python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID exvivo.sam HPNK.gff > exvivo.counts

python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID plate.sam HPNK.gff > plate.counts
count stats:

ex-vivo stats

no_feature 777946

ambiguous 1

too_low_aQual 0

not_aligned 1107761

alignment_not_unique 0

plate stats

no_feature 776707

ambiguous 47

too_low_aQual 0

not_aligned 263637

alignment_not_unique 0

edgeR code:

library(edgeR)

files <- dir(pattern="*\.counts$")

RG <- readDGE(files, header=FALSE)

RG

keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one count per million (cpm) in at least three samples

RG <- RG[keep,]

dim(RG)

RG <- calcNormFactors(RG)

RG$samples

plotMDS(RG)

bcv <- 0.2 #Assigned dispersion value of 0.2

m <- as.matrix(RG)

d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample number. Also can be adapted to replicated samples, see'?DGEList'.

d

et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, pair=(1:2),dispersion=bcv^2)

et

top <- topTags(et)

top

cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes:

summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR is given by'decideTestsDGE'.

[,1]

-1 200

0 1176

1 769

Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200 are down-regulated.

detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5% FDR

plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5% FDR

abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes.

title("plate vs ex-vivo")

dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf##

edger rna-seq bacteria ion-torrent • 5.8k views

ADD COMMENT • link 10.8 years ago by biotech ▴ 570

score 3 · Answer 1 · 2013-10-21

3

Entering edit mode

11.2 years ago

Sean Davis 27k

First, you simply need to do replicates. There is no way around this fact if you want a robust, reproducible result. That said, a gene list can be ranked by FDR, so you are free to choose an FDR of 0.0001 or 0.01, or 0.1 if you like to generate a list of reasonable size; 0.05 is not a magical value. If you choose not to do replicates, then plan on a lot of PCR to validate your findings.

ADD COMMENT • link 11.2 years ago by Sean Davis 27k

0

Entering edit mode

How many PCRs should I perform if I don't do replicates?

ADD REPLY • link 11.2 years ago by biotech ▴ 570

0

Entering edit mode

When publishing, is it possible to avoid PCR validation having some replicates or is always necessary to do some PCRs?

ADD REPLY • link 11.2 years ago by biotech ▴ 570

0

Entering edit mode

There is no rule here. The important message is that to believe data, one needs some level of replication. How important a role PCR will play depends on the experimental system and hypotheses being tested. The goal of both PCR and replication is to reduce the cost of false positive results; without one or the other (or both), the cost could be high.

ADD REPLY • link 11.2 years ago by Sean Davis 27k

score 2 · Answer 2 · 2014-04-14

2

Entering edit mode

10.7 years ago

Irsan ★ 7.8k

The amount of DEGs you get out is reallly dependent on the dispersion you manually put in. Have you put all your samples in one group, only keep 250 housekeeping genes and estimate the common dispersion? That dispersion estimate is the one you should put in. Also have a look at the edgeR manual to see a more detailed explanation what to do when you have no replicates (which is a bad idea already mentioned by Sean. If budget was the reason for not including replicates, its better to reduce sequencing yield per sample and include replicates).

ADD COMMENT • link 10.7 years ago by Irsan ★ 7.8k

0

Entering edit mode

Hi Irsan, thank you for your answer, it's really helpful, specially because you pointed out the detail about how to pick the dispersion value. I'm much obliged to you.

ADD REPLY • link 10.7 years ago by biotech ▴ 570