I am using goseq to correct for RNA-seq length bias. Since my dog genome (canFam3) is not supported in goseq database, I tried to make TXDb object and GOmap by myself. However, I am not sure if I am doing it correctly because:
Here is my PWF plot
When I applied
GO.MF=goseq(pwf,gene2cat=GOmap, test.cats=c("GO:MF"))
, I still got all the GO categories.The build-in KEGG pathway analysis doesn't work for non-native gene identifier.
So, I don't even sure if I can trust my results. If it's correct, it could potentially be a tutorial and benefit other users since the goseq documentation isn't very self-explanatory.
Does anyone have better solution about doing gene length bias corrected KEGG pathway analysis for non-native species in goseq? Any recommendation?
The following is my R script:
Get DESeq2 results
countdata=read.table("all.genes.txt", sep=" ", header=TRUE, row.names=1)
condition <- factor(c(rep("control",2), rep("rapid",3), rep("slow",3),
rep("control",2), rep("rapid",3), rep("slow",3),
rep("control",2), rep("rapid",3), rep("slow",3)))
timepoints <- factor(c(rep("t1",8), rep("t2",8), rep("t3",8)))
sampleTable <- data.frame(condition = as.factor(condition),
timepoints = as.factor(timepoints))
d.deseq<-DESeqDataSetFromMatrix(countData = countdata,
colData = sampleTable,
design = ~condition+timepoints+condition:timepoints)
ddsTC<-DESeq(d.deseq, test="LRT", reduced = ~ condition+timepoints)
ddsTC$condition <- relevel(ddsTC$condition, ref="control")
resTC<-results(ddsTC, alpha = 0.05)
Get GO terms from biomaRt
resTC$ensembl <- sapply( strsplit( rownames(resTC), split="\\+" ), "[", 1 )
ensembl = useMart( "ensembl", dataset = "cfamiliaris_gene_ensembl" )
genemap <- getBM( attributes = c("hgnc_symbol","go_id","ensembl_gene_id"),
filters = "ensembl_gene_id",
values = resTC$ensembl,
mart = ensembl )
idx <- match(resTC$ensembl, genemap$ensembl_gene_id)
resTC$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
resTC$go_id<- genemap$go_id[ idx ]
head(resTC)
log2 fold change (MLE): conditionslow.timepointst3
LRT p-value: '~ condition + timepoints + condition:timepoints' vs '~ condition + timepoints'
DataFrame with 6 rows and 9 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSCAFG00000000001 634.0943577 -1.4429324 0.5365985 8.8758883 0.06427766
ENSCAFG00000000002 0.7144562 0.8500395 5.1053260 0.8445657 0.93237552
ENSCAFG00000000005 16.6377634 0.8308960 0.7964801 3.6893245 0.44968051
padj ensembl hgnc_symbol go_id
<numeric> <character> <character> <character>
0.3191870 ENSCAFG00000000001 ENPP1 GO:0016787
NA ENSCAFG00000000002
0.8662957 ENSCAFG00000000005 PARD6G GO:0005515
Make TXDb object
cfamiliaris_gene_ensembl <- makeTxDbFromBiomart(dataset="cfamiliaris_gene_ensembl")
txsByGene=transcriptsBy(cfamiliaris_gene_ensembl,"gene")
lengthData=median(width(txsByGene))
select_genes <- as.vector(names(lengthData)%in%names(genes))
new_lengthData <- lengthData[select_genes]
Make PWF plot
pwf=nullp(genes,bias.data=new_lengthData)
Make GOmap
GOlist=resTC[,c(7,9)]
GOmap = as.data.frame(GOlist)
GO.wall=goseq(pwf,gene2cat=GOmap)
head(GO.wall)
category over_represented_pvalue under_represented_pvalue numDEInCat
105 GO:0003824 4.722997e-11 1.0000000 66
609 GO:0016627 7.830729e-06 0.9999998 6
607 GO:0016620 1.674500e-04 0.9999779 8
numInCat
105 359
609 8
607 24
term
105 catalytic activity
609 oxidoreductase activity, acting on the CH-CH group of donors
607 oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor
ontology
105 MF
609 MF
607 MF
637 MF
547 CC
213 CC
Using GO.MF=goseq(pwf,gene2cat=GOmap, test.cats=c("GO:MF"))
will give me the same result as GO.wall
.
Since the "limiting GO categories and other category based tests" are not working for non-native gene identifier, the KEGG pathway analysis doesn't work for me.
Does anyone have better solution about how to get KEGG pathway analysis for non-native species that take gene length bias into account?
Thanks a lot,
Candice
Based on my understanding, there are several papers talking about the influence of gene length bias and how to adjust for it. It made me feel uncomfortable to ignore the gene length bias and just pass the list of DE genes to gene ontology.
For your references: Gene ontology analysis for RNA-seq: accounting for selection bias Length Bias Correction in Gene Ontology Enrichment Analysis Using Logistic Regression
I hope that would help!
wow. I didn't think the gene lenth bias before ever.
It seems not very popular, I didn't see any tutorial talk about it too.
For enrichment, just a gene list , is enough, I am pretty sure, it's just a statistics.
For choosing DE genes, that's a trick question, not only the gene length , but also the p-value , fold changes should be taken care of .