How to get GO terms for E. Coli genes in R?
1
0
Entering edit mode
10.4 years ago
mamillerpa ▴ 40

I just did a differential gene expression analysis with cufflinks and imported it into R with cummerbund.

My reference genome and GFF annotation were from Ensembl, and the Ensembl IDs for genes were carried through instead of the mnemonic gene symbols. (see below)

Is there a way to get GO terms for each of my E. coli genes within R/bioconductor? Can I do it with gene identifiers like "gene:b3281", or do I need to translate those to some other identifiers like "aroE" ?

biolinux@biolinux-VirtualBox[biolinux] grep -C 1 aroE Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.22.gff3
Chromosome ensembl transcript 3429766 3430023 . - . ID=transcript:AAC76305;Parent=gene:b3280;biotype=protein_coding;external_name=yrdB-1;logic_name=ena
Chromosome ensembl gene 3430020 3430838 . - . ID=gene:b3281;biotype=protein_coding;description=dehydroshikimate reductase%2C NAD(P)-binding;external_name=aroE;logic_name=ena
Chromosome ensembl transcript 3430020 3430838 . - . ID=transcript:AAC76306;Parent=gene:b3281;biotype=protein_coding;external_name=aroE-1;logic_name=ena
Chromosome ensembl gene 3430843 3431415 . - . ID=gene:b3282;biotype=protein_coding;description=tRNA(ANN) t(6)A37 threonylcarbamoyladenosine modification protein%2C threonine-dependent ADP-forming ATPase;external_name=tsaC;logic_name=ena


biolinux@biolinux-VirtualBox[fromoffice] grep b3281 gene_exp.diff
gene:b3281 gene:b3281 - Chromosome:3429235-3430838 q1 q2 OK 24.697 33.6117 0.444627 0.672035 0.3566 0.504885 no
RNA-Seq bacteria go R • 4.3k views
ADD COMMENT
0
Entering edit mode

I have just parsed the "b numbers" and gene symbols out of the Ensmebl GFF file. Then I can get symbol:GO term mappings from topGO

Does this look even remotely reasonable?

EcExpGFF <- read.delim("Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.22.gff3", header=FALSE, stringsAsFactors=FALSE)

#I'm only interested in genes. Column 9 contains semicolon delimited key:value pairs, including the "b number" and the gene symbol "external gene id"
EcGFFgenes <- EcExpGFF[EcExpGFF$V3=="gene", 9]

#split the semicolon delimited list
EcLookupFromGFF <- read.csv(text = as.character(EcGFFgenes), header = FALSE, sep=";", stringsAsFactors = FALSE)

#remove the Key: portion of the text.

EcGenes<-substr(EcLookupFromGFF$V1,4,max(length(EcLookupFromGFF$V1)))
EcSyms <-substr(EcLookupFromGFF$V4,15,max(length(EcLookupFromGFF$V4)))

#now get the mapping between GO terms and the gene symbols. Use topGO library.

GO_BP2sym_list <- annFUN.org(whichOnto = "BP", EcSyms, mapping = "org.EcK12.eg.db", ID = "symbol")
ADD REPLY
0
Entering edit mode
10.4 years ago
oganm ▴ 60

You can use biomart package for this purpose. But it will not be an enrichment analysis and won't convey much useful information.

mart = useMart("ensembl", dataset="mmusculus_gene_ensembl") #replace mus musculus with your favorite species
goTerms = getBM(attributes = c("ensembl_gene_id", "goslim_goa_accession",
                               "goslim_goa_description"),
                filters = "mgi_symbol",
                values = geneIDs,
                mart = mart) #geneIDs is a vector containing your ensemble gene IDs
ADD COMMENT
0
Entering edit mode

Thanks, oganm.

> mart = useMart('ensembl') ; 
> listDatasets(mart)

doesn't show anything for E. coli. I was under the impression that Ensembl Bacterial didn't have a mart.

ADD REPLY

Login before adding your answer.

Traffic: 1436 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6