I was trying to a differential gene expression analysis by using Deseq2 [1] with samples like this [2], and would like to plot a heatmap using differentially expressed genes (DEGs) against all the samples, this is my initial try [3], that did not work out because I tried to subset the "resCellA" object, get the order from the subsetted object and use that to pull out rows of "rld" based on numerical indexing, but the index are not linked. I did that because I am not so sophisticated with R coding, and tried to solve the problem by aping the example in the vignette.
Actually I got kind suggestion that in stead of using index I need to use gene ID which is consistent across the res and mat. Could anyone advise me how to do that? I have two DataFrame [4], I guess what I need to do is to get the geneIDs from res with padj < 0.05, and retrieve those geneIDs in assay(rld),but how to do that?
Thanks very much in advance. (I understand that in the long run, I need to do more homework about R coding...)
[1]
sampleTable <- read.csv("samples.csv",row.names=1)
filenames <- paste0(sampleTable$samplename,".bam")
library("Rsamtools")
bamfiles <- BamFileList(filenames, yieldSize=2000000)
gtffile <- ("mm10genes.gtf")
library("GenomicFeatures")
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character()))
(ebg <- exonsBy(txdb, by="gene"))
library("GenomicAlignments")
library("BiocParallel")
register(MulticoreParam())
se <- summarizeOverlaps(features=ebg, reads=bamfiles,
mode="Union",
singleEnd=TRUE,
ignore.strand=TRUE,
fragments=FALSE )
library("DESeq2")
(colData(se) <- DataFrame(sampleTable))
dds <- DESeqDataSet(se, design = ~ compartment + genotype)
dds <- dds[rowSums(counts(dds)) > 1,]
dds <- DESeq(dds)
rld <- rlog(dds, blind=FALSE)
library("gene filter")
[2]
"samples.csv"
"ids","samplename","genotype","compartment"
"457CellA","457CellA","WT","CellA"
"457CellB","457CellB","WT","CellB"
"457CellC","457CellC","WT","CellC"
"457CellD","457CellD","WT","CellD"
"458CellA","458CellA","cKO","CellA"
"458CellB","458CellB","cKO","CellB"
"458CellC","458CellC","cKO","CellC"
"458CellD","458CellD","cKO","CellD"
"459CellA","459CellA","WT","CellA"
"459CellB","459CellB","WT","CellB"
"459CellC","459CellC","WT","CellC"
"459CellD","459CellD","WT","CellD"
"460CellA","460CellA","cKO","CellA"
"460CellB","460CellB","cKO","CellB"
"460CellC","460CellC","cKO","CellC"
"460CellD","460CellD","cKO","CellD"
[3]
res <- results(dds)
sigGenes <- subset(res,padj <0.05)
sigGenes <- head(order(sigGenes$padj),decreasing=TRUE,40)
mat <- assay(rld)[ sigGenes, ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(rld)[,c("compartment","genotype")])
pheatmap(mat, annotation_col=df,fontsize=9,fontsize_number = 0.4 * fontsize)
dev.off()
[4]
> head(assay(rld))
457CellA 458CellB 459CellC 460CellD
0610005C13Rik 1.808277 1.6076085 1.533645 1.695644
0610007P14Rik 8.191532 7.5024585 8.660520 8.334527
0610009B22Rik 5.807757 5.6569778 5.989643 6.123990
0610009L18Rik 3.881685 3.5666925 3.786769 3.921051
0610009O20Rik 6.800692 6.8433131 6.951461 7.218018
0610010B08Rik -1.224636 -0.8949234 -1.230797 -1.226311
> res <- results(dds)
> res
log2 fold change (MAP): compartment CellA vs CellB
Wald test p-value: compartment CellA vs CellB
DataFrame with 23372 rows and 6 columns
baseMean log2FoldChange lfcSE stat padj
<numeric> <numeric> <numeric> <numeric> <numeric>
0610005C13Rik 3.842899 -0.81806436 0.6916742 -1.18273074 0.23691588
0610007P14Rik 297.131215 0.22129241 0.1115736 1.98337585 0.04732546
0610009B22Rik 60.686295 0.43130300 0.2452162 1.75886844 0.07859986
0610009L18Rik 14.512324 0.02226444 0.3606164 0.06173996 0.95076992
0610009O20Rik 125.067373 0.19888416 0.1723663 1.15384599 0.24856332
Hi Goutham, Thank you very much. I guess I had worked out a more tedious version but yours are more succinct. So I learnt a lot from your answer.
Just have another question, why "& !is.na(res$padj)" is needed? I can not imaging a case in which padj value is missing. TKS in advance.
I was looking for this thank your for the solution but i have one doubt why are you using this can you explain
& !is.na(res$padj),
?Thank you for your post, its was simple and it works!