I have a list of genes with ensembl id and GO terms matched by biomaRt. How can I obtain a specific group from the list by a GO term in R? For example, extracting "GO:0003700 sequence-specific DNA binding transcription factor" to make a new table containing all information from the original table. Thanks
The .csv file contains tab-separated columns (please see below). For example, I would like to make a list with only genes containing "GO:0005216". How shall I tell R to do it? Thanks.
ENSMUSG00000000001 GO:0016021| GO:0005216| GO:0005244| GO:0005272
ENSMUSG00000000002 GO:0008150| GO:0005576| GO:0005575| GO:0003674
ENSMUSG00000000003 GO:0008150| GO:0005216| GO:0005524| GO:0003674
If you've already run biomaRt to obtain a data frame, then your aim is to extract subsets of a data frame. I can post examples if this is what you want to do; have a look at ?subset and ?grep in R.
Yes, just add a GO term filter to your query. This is just 'go' as the filter, then your term(s) of interest. If you want a subset of another table, do the query without the GO filter, then the same query with the GO filter.
# load data frame
dat = read.table("~/tmp/1.txt")
# substitute weird | symbol
dt <- as.data.frame(
lapply(dat,function(x) if(is.character(x)|is.factor(x)) gsub("\\|","",x) else x))
# here I find which rows contain the value "GO:0005216": i do linearize into the vector subset of data frame dt[,2:5] and
# the trick is to move to 0 based index (R arrays indexed from 1) and to return back to 1 based to find out rows
w = (which(dt[,2:5] == "GO:0005216") - 1) %% length(dt$V1) + 1
# print the result
dt[w,]
> dt[w,]
V1 V2 V3 V4 V5
1 ENSMUSG00000000001 GO:0016021 GO:0005216 GO:0005244 GO:0005272
3 ENSMUSG00000000003 GO:0008150 GO:0005216 GO:0005524 GO:0003674
but, if you do know that V3 is the variable you are interested in, it is easy to query it
# by using the exact value
w = which(dt$V3 == "GO:0005216")
# or using regex
w = which(grepl("GO:0005216",dt$V3))
dt[w,]
I tried and encountered another problem that my GO id in each row are concatenated (not separated by tab or any symbol except the vertical symbol "|"). Also, the numbers of GO id are different in each row. Thus, the search for "GO:005216" returned numeric(0). How can I transform the GO into columns and define the number of column for search?
If I understand, there could be a problem with your file - i.e. dat=read.table("~/tmp/1.txt") doesn't work right? please check the manual of read.table how it treats separators (the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns) and pre-format your data before loading into R, or try to specify sep="|". You can re-format text for example in vim using substitution. I have used your data and it works as it is because all values are spaced.
Sorry, I am not familiar with biomaRt, but if you can specify how your data looks like and what you would like to get, I can help with R tricks
The .csv file contains tab-separated columns (please see below). For example, I would like to make a list with only genes containing "GO:0005216". How shall I tell R to do it? Thanks. ENSMUSG00000000001 GO:0016021| GO:0005216| GO:0005244| GO:0005272
ENSMUSG00000000002 GO:0008150| GO:0005576| GO:0005575| GO:0003674
ENSMUSG00000000003 GO:0008150| GO:0005216| GO:0005524| GO:0003674
If you've already run biomaRt to obtain a data frame, then your aim is to extract subsets of a data frame. I can post examples if this is what you want to do; have a look at ?subset and ?grep in R.