Hello guys, I'm analyzing the differential expression genes in 4 groups, with data from RNAseq. I made this analysis from the "Deseq2" package and this is an result example of 1 group:
#out of 20172 with nonzero total read count
#adjusted p-value < 0.1
#LFC > 0 (up) : 20, 0.099%
#LFC < 0 (down) : 2, 0.0099%
#outliers [1] : 15, 0.074%
#low counts [2] : 9575, 47%
#(mean count < 41)
After that, I went to do the enrichment analysis with the "goseq" package, with the following commands:
#Nelore - atividade
all_genes_N_atv=row.names(res_N_atv)
de_genes_N_atv=all_genes_N_atv[which(res_N_atv$padj<=0.1)]
de_N_atv_vector=c(t(de_genes_N_atv))
all_genes_N_vector_atv=c(t(all_genes_N_atv))
gene.vector.N.atv=as.integer(all_genes_N_vector_atv%in%de_N_atv_vector)
names(gene.vector.N.atv)=all_genes_N_vector_atv
table(gene.vector.N.atv)
pwf_N_atv=nullp(gene.vector.N.atv, "bosTau4",'ensGene')
go_N_atv=goseq(pwf_N_atv,'bosTau4','ensGene')
head(go_N_atv)
enriched_N_atv=go_N_atv$category[go_N_atv$over_represented_pvalue<=0.1]
capture.output(for(go in enriched_N_atv) {
print(GOTERM[[go]])
cat("--------------------------\n")
}, file = "enriched_N_atv.txt")
First, the organisms supported by this package list 5 genomes of the organism I study (beef cattle). The reference genome I used was more current, ARS-UCD1.2, obtained from Ensembl. However, I don't know if I really chose the right Genome for the analysis, because I didn't find which of the goseq options is the one I need, so I used "bosTau4" and "ensGene".
Second, enrichment analysis shows me more genes than what I got from differential expression analysis:
> length(enriched_N_atv)
[1] 257
> table(gene.vector.N.atv)
gene.vector.N.atv
0 1
27585 22
What could have happened?
The goseq output should give you the gene sets, which are significantly enriched in your differntially expressed genes, so
length(enriched_N_atv)
should give you the number of significant gene sets, not genes? Can you show the content of enriched_N_atv?The advantage of GOseq is, that it is able to correct for the bias in the different length of genes of the different GO-terms. So since you already use GOseq, you could include this bias correction in the analysis during execution of the
nullp
function.Hello!
Thanks for the info!
enriched_N_atv has:
Hi, so as you see, the result object
enriched_N_atv
holds not gene names, but gene ontology term IDs, like "GO:0001803". A gene ontology term is basically a list of genes, which belong for example to a certain biological process. In case of "GO:0001803", its longer name is "regulation of type III hypersensitivity", so it holds genes, which were in the past described to play a role in this process. Using the GOseq package correctly, this result would tell you, that genes e.g. from the gene list "GO:0001803" were significantly enriched in your list of significantly expressed genes. It seems weird, though, that you get 257 differential GO terms with only 22 differentially expressed genes.Can someone help please?