Hi there Biostars, I'm working on a GO enrichment analysis for some wheat RNAseq data. I'd like to use the package GOseq for this and have been following the vignette. The package requires 3 data sets, first: a vector of all the genes in your transcriptome, with a '1' denoting DE genes, and '0' for non-DE genes, second: a vector for all of the genes, with the length of each gene, and third: a data frame with two columns for all of the genes and GO terms (each gene will have multiple GO terms so repeating rows), OR a list of lists where the name of each list is the gene name with a list of GO terms.
I had no problem fitting the Probability Weighting Function (PWF) with: pwf = nullp(DEgenes, bias.data = my_length_vector)
The GO terms I downloaded for wheat from BioMart are in the two column data frame format, so that's what I tried first with the code:
GO.wall = goseq(pwf, gene2cat = wheat_GO_terms)
but get three errors:
Error: node stack overflow.
Error during wrapup: node stack overflow.
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Does anyone know how to overcome these errors in GOseq?
I manually created a very short list of lists to see if that works, and it does but I am struggling to create the list of lists from the two column data frame with repeating row values. The GOseq manual indicates the data frame approach should work.
I'd love to hear from you if you've had success with the data frame input format. OR if you can help with converting data frame of repeating row gene names associated with unique GO terms in the second column to a list of lists where gene name lists of the GO terms, that would be fantastic. Thank you!!
is just a typo, right? You have tried split(df$GO, df$gene_id)?
Hi yes, my mistake, that is a typo! Thanks for your comment.
My gene names are formatted based on the reference genome annotation, for example "TraesCS4A02G403700". The .txt file for the
df
has gene names and GO terms, which came directly from BioMart ensemble download. I read in the .txt file withread_delim( file path, delim = ",")
and get two columns of character variables.split(df$GO, df$gene_id)
produces a list of lists that is the total length of the # of unique gene_ids but the lists have a different gene name format, for example "ENSRNA050007810" and the list is just length "NA".When I run the inverse of what GOseq wants,
split(df$gene_id, df$GO)
I get a nice list of lists that is the total length of the unique GO terms, and the name of each list is a GO term filled with the associated gene_ids of the appropriate format, "TraesCS4A02G403700".I am pretty stumped - I've never come across something like this before. Thanks!