Hi All,
I am working on TCGA Breast Cancer dataset. I downloaded the Count data using TCGA2STAT.
counts.brca <- getTCGA(disease="BRCA", data.type="RNASeq", type="count")
Extracted the mutation data and clinical data using library("cgdsr")
For downstream analysis I am using DESeq2.
ddsMat<- DESeqDataSetFromMatrix(tcga_comb, sampletable, design=~mut.status)
dds<-DESeq(ddsMat)
dds<-DESeq(dds)
res<-results(dds, contrast= c("mutvswt", "Mut", "WT"))
resOrdered <- res[order(res$log2FoldChange,decreasing=TRUE),]
resOrdered
log2 fold change (MLE): mutvswt Mut vs WT
Wald test p-value: mutvswt Mut vs WT
DataFrame with 20502 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
RPL13AP17 3.544641 2.310630 0.3485938 6.628432 3.392722e-11
PNLIP 1.872278 1.916453 0.4180853 4.583879 4.564283e-06
CST4 616.052645 1.806587 0.1851868 9.755485 1.747628e-22
LOC100287718 2.395814 1.653026 0.3204551 5.158369 2.491100e-07
SEMG2 1.345911 1.647315 0.4141256 3.977814 6.955174e-05
padj
<numeric>
RPL13AP17 1.446313e-09
PNLIP 4.743272e-05
CST4 4.715005e-20
LOC100287718 3.851038e-06
SEMG2 4.909754e-04
However, when I plotted heatmap of the upregulated genes. I did not see clear clustering between Mutant vs WT so I checked the raw count of top upregulated gene(RPL13AP17), I observed only 4 patients have expression higher than 1000 and the rest is super low which I think shouldn't pop-up as significantly upregulated gene. Also the 2nd top upregulated gene (PNLIP) was similar.
select<- rownames(subset(res, padj < 0.001 & log2FoldChange >1))
ntd <- normTransform(dds)
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,show_colnames= FALSE, cluster_cols=FALSE)
summary(res)
out of 20490 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 1906, 9.3%
LFC < 0 (down) : 6424, 31%
outliers [1] : 0, 0%
low counts [2] : 795, 3.9%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
table(assay(dds)['RPL13AP17',])
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
452 73 75 16 19 2 4 2 4 2 1 2 2 1 1 2
19 25 26 27 47 89 93 132 228 369 488 577 731 1737 1754 1983
1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
6852
1
table(assay(dds)['PNLIP',])
0 1 2 3 4 5 6 7 8 9 10 12 13 15 16 17 18 20 22 25
526 29 50 4 13 1 9 1 4 1 3 5 2 2 1 1 2 2 1 1
26 30 36 42 45 50 51 60 70 73 76 79 83 110 111 131 423
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I thought independent filtering was taking care of these low count genes in results() function. I will be really glad if you could suggest me how can I remove these genes to get the distinctly differentially expressed genes.
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)
Matrix products: default
BLAS: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRblas.so
LAPACK: /mnt/software/stow/R-3.4.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] hexbin_1.27.1 TCGA2STAT_1.2
[3] pheatmap_1.0.8 genefilter_1.58.1
[5] gplots_3.0.1 RColorBrewer_1.1-2
[7] vsn_3.44.0 org.Hs.eg.db_3.4.1
[9] DESeq2_1.16.1 BiocParallel_1.10.1
[11] GenomicAlignments_1.12.2 SummarizedExperiment_1.6.5
[13] DelayedArray_0.2.7 matrixStats_0.52.2
[15] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2
[17] Biobase_2.36.2 Rsamtools_1.28.0
[19] Biostrings_2.44.2 XVector_0.16.0
[21] GenomicRanges_1.28.5 GenomeInfoDb_1.12.2
[23] IRanges_2.10.3 S4Vectors_0.14.5
[25] BiocGenerics_0.22.0 BiocInstaller_1.28.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.4.1 gtools_3.5.0
[4] Formula_1.2-2 affy_1.54.0 latticeExtra_0.6-28
[7] blob_1.1.0 GenomeInfoDbData_0.99.0 RSQLite_2.0
[10] backports_1.1.1 lattice_0.20-35 limma_3.32.7
[13] digest_0.6.12 checkmate_1.8.4 colorspace_1.3-2
[16] preprocessCore_1.38.1 htmltools_0.3.6 Matrix_1.2-11
[19] R.oo_1.21.0 plyr_1.8.4 pkgconfig_2.0.1
[22] XML_3.98-1.9 biomaRt_2.32.1 zlibbioc_1.22.0
[25] xtable_1.8-2 scales_0.5.0 gdata_2.18.0
[28] affyio_1.46.0 htmlTable_1.9 tibble_1.3.4
[31] annotate_1.54.0 ggplot2_2.2.1 nnet_7.3-12
[34] lazyeval_0.2.0 survival_2.41-3 magrittr_1.5
[37] memoise_1.1.0 R.methodsS3_1.7.1 foreign_0.8-69
[40] tools_3.4.1 data.table_1.10.4 stringr_1.2.0
[43] munsell_0.4.3 locfit_1.5-9.1 cluster_2.0.6
[46] compiler_3.4.1 caTools_1.17.1 rlang_0.1.2
[49] grid_3.4.1 RCurl_1.95-4.8 htmlwidgets_0.9
[52] labeling_0.3 bitops_1.0-6 base64enc_0.1-3
[55] gtable_0.2.0 DBI_0.7 gridExtra_2.3
[58] knitr_1.17 rtracklayer_1.36.4 bit_1.1-12
[61] Hmisc_4.0-3 KernSmooth_2.23-15 stringi_1.1.5
[64] Rcpp_0.12.13 geneplotter_1.54.0 rpart_4.1-11
[67] acepack_1.4.1
Try basing the independent filtering on the median count rather than mean or sum.
Hi Devon,
Thanks for your reply. I tried your suggestion, but I have doubts about my method. I made the change at the results() function as :
Is this the correct approach ?
Even though the padj became NA, I still get the same genes as top hits. What is more, I plotted the top upregulated genes after omitting the NAs, but my heatmap still doesnt show the differential gene expression between Mutant vs wt.How can I improve the heatmap ?
How are you plotting the heatmap? I bet the problem is with the plot, not the analysis.
This dataset seems to contain something between 50-100 samples, with this number of samples there is statistical power even for genes with low counts, so unless you are definitely not interested in low counts genes, i would keep them.
Hi h.mon,
We are trying to find genes that are significantly differentially expressed to identify as signature genes so we need set of genes that show distinct expressional difference between Mutant vs WT.
There is 676 patients in that heatmap:
This is how I plot the heatmap. I used both normTrnasform and vsd normalization but it did not change the heatmap:
Try this: