Too Many Deg With EdgeR Output?
2
0
Entering edit mode
11.1 years ago
biotech ▴ 570

Hi there,

I'm dealing with bacterial RNA-seq analysis. My experiment is very simple. Two samples to compare and no replicates. Reads were generated with Ion Torrent PGM using 316 chip. One for each sample and performed in different days.

Since I had too many differentially expressed genes, ¿Should I be more conservative assigning edgeR dispersion value? Also, there are considerable more up-regulated genes in exvivo than in plate sample.

See logFC_vs_logCPM figure:

https://docs.google.com/file/d/0B8-ZAuZe8jldY0Y5SWJSdXh1WGc/edit?usp=sharing

Thanks for you help, Bernardo

  • tmap code:

tmap map2 -f HPNK_clean.fsa -r exvivo.fastq -i fastq -s exvivo.sam --verbose

tmap map2 -f HPNK_clean.fsa -r plate.fastq -i fastq -s plate.sam --verbose

  • Flasgstat:

exvivo:

3240242 + 0 in total (QC-passed reads + QC-failed reads)

2132481 + 0 mapped (65.81%:nan%)

plate:

3774075 + 0 in total (QC-passed reads + QC-failed reads)

3510438 + 0 mapped (93.01%:nan%)

  • count:

    python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID exvivo.sam HPNK.gff > exvivo.counts

    python -m HTSeq.scripts.count -m intersection-nonempty -t CDS -i ID plate.sam HPNK.gff > plate.counts

  • count stats:

ex-vivo stats

no_feature 777946

ambiguous 1

too_low_aQual 0

not_aligned 1107761

alignment_not_unique 0

plate stats

no_feature 776707

ambiguous 47

too_low_aQual 0

not_aligned 263637

alignment_not_unique 0

  • edgeR code:

library(edgeR)

files <- dir(pattern="*\.counts$")

RG <- readDGE(files, header=FALSE)

RG

keep <- rowSums(cpm(RG)>1) >= 2 #we keep genes that achieve at least one count per million (cpm) in at least three samples

RG <- RG[keep,]

dim(RG)

RG <- calcNormFactors(RG)

RG$samples

plotMDS(RG)

bcv <- 0.2 #Assigned dispersion value of 0.2

m <- as.matrix(RG)

d <- DGEList(counts=m, group=(1:2)) #modify 'group' depending on sample number. Also can be adapted to replicated samples, see'?DGEList'.

d

et <- exactTest(d, pair=(1:2),dispersion=bcv^2) #exactTest(RG, pair=(1:2),dispersion=bcv^2)

et

top <- topTags(et)

top

cpm(RG)[rownames(top), ] #Check the individual cpm values for the top genes:

summary(de <- decideTestsDGE(et)) #The total number of DE genes at 5% FDR is given by'decideTestsDGE'.

[,1]

-1 200

0 1176

1 769

Of the 'number' tags identified as DE, 769 are up-regulated ex-vivo and 200 are down-regulated.

detags <- rownames(RG)[as.logical(de)] #detags contains the DE genes at 5% FDR

plotSmear(et, de.tags=detags) #plot all genes and highlight DE genes at 5% FDR

abline(h=c(-1, 1), col="blue") #The blue lines indicate 2-fold changes.

title("plate vs ex-vivo")

dev.copy2pdf(file = "Figure_1.pdf") #Save as .pdf##

edger rna-seq bacteria ion-torrent • 5.8k views
ADD COMMENT
3
Entering edit mode
11.1 years ago

First, you simply need to do replicates. There is no way around this fact if you want a robust, reproducible result. That said, a gene list can be ranked by FDR, so you are free to choose an FDR of 0.0001 or 0.01, or 0.1 if you like to generate a list of reasonable size; 0.05 is not a magical value. If you choose not to do replicates, then plan on a lot of PCR to validate your findings.

ADD COMMENT
0
Entering edit mode

How many PCRs should I perform if I don't do replicates?

ADD REPLY
0
Entering edit mode

When publishing, is it possible to avoid PCR validation having some replicates or is always necessary to do some PCRs?

ADD REPLY
0
Entering edit mode

There is no rule here. The important message is that to believe data, one needs some level of replication. How important a role PCR will play depends on the experimental system and hypotheses being tested. The goal of both PCR and replication is to reduce the cost of false positive results; without one or the other (or both), the cost could be high.

ADD REPLY
2
Entering edit mode
10.6 years ago
Irsan ★ 7.8k
The amount of DEGs you get out is reallly dependent on the dispersion you manually put in. Have you put all your samples in one group, only keep 250 housekeeping genes and estimate the common dispersion? That dispersion estimate is the one you should put in. Also have a look at the edgeR manual to see a more detailed explanation what to do when you have no replicates (which is a bad idea already mentioned by Sean. If budget was the reason for not including replicates, its better to reduce sequencing yield per sample and include replicates).
ADD COMMENT
0
Entering edit mode

Hi Irsan, thank you for your answer, it's really helpful, specially because you pointed out the detail about how to pick the dispersion value. I'm much obliged to you.

ADD REPLY

Login before adding your answer.

Traffic: 1699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6