Question

How to do gene length bias corrected KEGG pathway analysis for non-native species in goseq?

1

Entering edit mode

9.1 years ago

CandiceChuDVM ★ 2.5k

I am using goseq to correct for RNA-seq length bias. Since my dog genome (canFam3) is not supported in goseq database, I tried to make TXDb object and GOmap by myself. However, I am not sure if I am doing it correctly because:

Here is my PWF plot
When I applied GO.MF=goseq(pwf,gene2cat=GOmap, test.cats=c("GO:MF")), I still got all the GO categories.
The build-in KEGG pathway analysis doesn't work for non-native gene identifier.

So, I don't even sure if I can trust my results. If it's correct, it could potentially be a tutorial and benefit other users since the goseq documentation isn't very self-explanatory.

Does anyone have better solution about doing gene length bias corrected KEGG pathway analysis for non-native species in goseq? Any recommendation?

The following is my R script:

Get DESeq2 results

countdata=read.table("all.genes.txt", sep=" ", header=TRUE, row.names=1)
condition <- factor(c(rep("control",2), rep("rapid",3), rep("slow",3),
                  rep("control",2), rep("rapid",3), rep("slow",3),
                  rep("control",2), rep("rapid",3), rep("slow",3)))
timepoints <- factor(c(rep("t1",8), rep("t2",8), rep("t3",8)))
sampleTable <- data.frame(condition = as.factor(condition),
                      timepoints = as.factor(timepoints))
d.deseq<-DESeqDataSetFromMatrix(countData = countdata,
                            colData = sampleTable,
                            design = ~condition+timepoints+condition:timepoints)
ddsTC<-DESeq(d.deseq, test="LRT", reduced = ~ condition+timepoints)
ddsTC$condition <- relevel(ddsTC$condition, ref="control")
resTC<-results(ddsTC, alpha = 0.05)

Get GO terms from biomaRt

resTC$ensembl <- sapply( strsplit( rownames(resTC), split="\\+" ), "[", 1 )
ensembl = useMart( "ensembl", dataset = "cfamiliaris_gene_ensembl" )
genemap <- getBM( attributes = c("hgnc_symbol","go_id","ensembl_gene_id"),
              filters = "ensembl_gene_id",
              values = resTC$ensembl,
              mart = ensembl )
idx <- match(resTC$ensembl, genemap$ensembl_gene_id)
resTC$hgnc_symbol <- genemap$hgnc_symbol[ idx ]
resTC$go_id<- genemap$go_id[ idx ]

head(resTC)

log2 fold change (MLE): conditionslow.timepointst3 
LRT p-value: '~ condition + timepoints + condition:timepoints' vs '~ condition + timepoints' 
DataFrame with 6 rows and 9 columns
                  baseMean log2FoldChange     lfcSE      stat     pvalue
                 <numeric>      <numeric> <numeric> <numeric>  <numeric>
ENSCAFG00000000001 634.0943577     -1.4429324 0.5365985 8.8758883 0.06427766
ENSCAFG00000000002   0.7144562      0.8500395 5.1053260 0.8445657 0.93237552
ENSCAFG00000000005  16.6377634      0.8308960 0.7964801 3.6893245 0.44968051

                    padj            ensembl hgnc_symbol       go_id
               <numeric>        <character> <character> <character>
0.3191870 ENSCAFG00000000001       ENPP1  GO:0016787
NA ENSCAFG00000000002                        
0.8662957 ENSCAFG00000000005      PARD6G  GO:0005515

Make TXDb object

cfamiliaris_gene_ensembl <- makeTxDbFromBiomart(dataset="cfamiliaris_gene_ensembl")
txsByGene=transcriptsBy(cfamiliaris_gene_ensembl,"gene")
lengthData=median(width(txsByGene))
select_genes <- as.vector(names(lengthData)%in%names(genes))
new_lengthData <- lengthData[select_genes]

Make PWF plot

pwf=nullp(genes,bias.data=new_lengthData)

Make GOmap

GOlist=resTC[,c(7,9)]
GOmap = as.data.frame(GOlist)
GO.wall=goseq(pwf,gene2cat=GOmap)

head(GO.wall)

 category over_represented_pvalue under_represented_pvalue numDEInCat
 105 GO:0003824            4.722997e-11                1.0000000         66
 609 GO:0016627            7.830729e-06                0.9999998          6
 607 GO:0016620            1.674500e-04                0.9999779          8

numInCat
 105      359
 609        8
 607       24

                                                                                           term
 105                                                                              catalytic activity
 609                                    oxidoreductase activity, acting on the CH-CH group of donors
 607 oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor

ontology
 105       MF
 609       MF
 607       MF
 637       MF
 547       CC
 213       CC

Using GO.MF=goseq(pwf,gene2cat=GOmap, test.cats=c("GO:MF")) will give me the same result as GO.wall.

Since the "limiting GO categories and other category based tests" are not working for non-native gene identifier, the KEGG pathway analysis doesn't work for me.

Does anyone have better solution about how to get KEGG pathway analysis for non-native species that take gene length bias into account?

Thanks a lot,
Candice

goseq biomaRt R RNA-Seq gene ontology • 4.5k views

ADD COMMENT • link updated 9.1 years ago by jimmy_zeng ▴ 90 • written 9.1 years ago by CandiceChuDVM ★ 2.5k

score 0 · Answer 1 · 2016-05-16

0

Entering edit mode

9.1 years ago

jimmy_zeng ▴ 90

Since I didn't use the package “goseq ” ever, I can't answer the question about this package.

But for doing KEGG pathway analysis, the majority us will choose enrichment, there's nothing matter to gene length, just choose the a list of interesting genes . and then do hyperGtest for the gene set.

Hope this can help you .

ADD COMMENT • link 9.1 years ago by jimmy_zeng ▴ 90

0

Entering edit mode

Based on my understanding, there are several papers talking about the influence of gene length bias and how to adjust for it. It made me feel uncomfortable to ignore the gene length bias and just pass the list of DE genes to gene ontology.

For your references: Gene ontology analysis for RNA-seq: accounting for selection bias Length Bias Correction in Gene Ontology Enrichment Analysis Using Logistic Regression

I hope that would help!

ADD REPLY • link 9.1 years ago by CandiceChuDVM ★ 2.5k

0

Entering edit mode

wow. I didn't think the gene lenth bias before ever.

It seems not very popular, I didn't see any tutorial talk about it too.

For enrichment, just a gene list , is enough, I am pretty sure, it's just a statistics.

For choosing DE genes, that's a trick question, not only the gene length , but also the p-value , fold changes should be taken care of .

ADD REPLY • link 9.0 years ago by jimmy_zeng ▴ 90