I didn't post the full question here, the posting guide is broken. If anyone has any ideas, I'd be very grateful!
https://stackoverflow.com/questions/70934464/reformatting-data-for-gene-ontologygo-analysis-using-r
I didn't post the full question here, the posting guide is broken. If anyone has any ideas, I'd be very grateful!
https://stackoverflow.com/questions/70934464/reformatting-data-for-gene-ontologygo-analysis-using-r
My tmp.csv is just a file that matches your old format.
library(pacman)
p_load(tidyverse)
tmp <- read.csv('tmp.csv') %>%
tibble() %>%
separate(col = V1, into = c('id', 'term'), sep = ' ;')
tmp %>%
group_by(id) %>%
nest() %>%
ungroup() %>%
mutate(data = map(data, unlist),
data = map(data, paste0, collapse = ' ;')) %>%
unnest()
Good luck!
Are you just trying to collect GO
terms associated with each ENSIPUG
ID into a single row?
#Your data
df <- read.table(text = "ENSIPUG00000001371 ;GO:0008236
ENSIPUG00000001371 ;GO:0008233
ENSIPUG00000001371 ;GO:0070011
ENSIPUG00000001371 ;GO:0016787
ENSIPUG00000001371 ;GO:0017171
ENSIPUG00000001371 ;GO:0140096
ENSIPUG00000001374 ;GO:0005515
ENSIPUG00000001374 ;GO:0003674
ENSIPUG00000001374 ;GO:0005488
ENSIPUG00000001375 ;GO:0008152
ENSIPUG00000001375 ;GO:0008150
ENSIPUG00000001375 ;GO:0016758", sep = ",")
#Libraries we need.
library(magrittr)
library(stringr)
library(dplyr)
#Putting the ENSI identifiers and GO terms in their
#own columns.
df %<>%
rowwise() %>%
mutate(V2 = unlist(str_split(V1, " ;"))[2],
V1 = unlist(str_split(V1, " ;"))[1])
#Collecting each ENSI identifier's set of GO terms into
#a single row.
df %<>%
group_by(V1) %>%
mutate(V2 = paste0(V2, collapse = "; ")) %>%
ungroup() %>%
distinct(V1, .keep_all = TRUE)
#Taking a glance at the output.
df
# # A tibble: 3 × 2
# V1 V2
# <chr> <chr>
# 1 ENSIPUG00000001371 GO:0008236; GO:0008233; GO:0070011; GO:0016787; GO:0017171; GO:0140096
# 2 ENSIPUG00000001374 GO:0005515; GO:0003674; GO:0005488
# 3 ENSIPUG00000001375 GO:0008152; GO:0008150; GO:0016758
#Write to file.
#File's saved as "newformat.csv" to whatever location getwd()
#indicates.
write.table(df, file = "newformat.csv", quote = TRUE, sep = ",")
I tried this, thank you for your work! However, in the new .csv all it will return for V2 is NA in every column in the same amount that the GO terms should be. Do you know any reason that it wouldn't display the GO terms?
Below is a sample of the output I'm getting.
"V1","V2"
"1","ENSIPUG00000000007","NA; NA; NA; NA; NA"
"2","ENSIPUG00000000011","NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA"
Here is the head function from the end of what happened when I ran your solution.
head(df)
# A tibble: 6 × 2
V1 V2
<chr> <chr>
1 ENSIPUG00000000007 NA; NA; NA; NA; NA
2 ENSIPUG00000000011 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
3 ENSIPUG00000000017 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
4 ENSIPUG00000000020 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
5 ENSIPUG00000000023 NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA; NA…
6 ENSIPUG00000000024 NA; NA; NA; NA; NA; NA
Here is what the first few lines of the input file are:
ENSIPUG00000000007 ;GO:0055114
ENSIPUG00000000007 ;GO:0008152
ENSIPUG00000000007 ;GO:0008150
ENSIPUG00000000007 ;GO:0016020
ENSIPUG00000000007 ;GO:0005575
ENSIPUG00000000011 ;GO:0006120
ENSIPUG00000000011 ;GO:0055114
Thank you again, I appreciate you following up with me!
Oh I meant head()
for the object containing the input file's data. I wanted to check if ENSIPUG00000000007 ;GO:0055114
is being imported as a single column into R
(I am assuming you're importing it into a data.frame
).
I'm guessing the whitespace between ENSIPUG00000000007 ;GO:0055114
is a tab character (\t
)? If it is, can you please replace
#Putting the ENSI identifiers and GO terms in their
#own columns.
df %<>%
rowwise() %>%
mutate(V2 = unlist(str_split(V1, " ;"))[2],
V1 = unlist(str_split(V1, " ;"))[1])
With this
#Putting the ENSI identifiers and GO terms in their
#own columns.
df %<>%
rowwise() %>%
mutate(V2 = unlist(str_split(V1, "[\\s\\t]+;"))[2],
V1 = unlist(str_split(V1, "[\\s\\t]+;"))[1])
And try running the script again?
Using data.table:
library(data.table)
# example data
d <- fread(text = "ENSIPUG00000001371 ;GO:0008236
ENSIPUG00000001371 ;GO:0008233
ENSIPUG00000001371 ;GO:0070011
ENSIPUG00000001371 ;GO:0016787
ENSIPUG00000001371 ;GO:0017171
ENSIPUG00000001371 ;GO:0140096
ENSIPUG00000001374 ;GO:0005515
ENSIPUG00000001374 ;GO:0003674
ENSIPUG00000001374 ;GO:0005488
ENSIPUG00000001375 ;GO:0008152
ENSIPUG00000001375 ;GO:0008150
ENSIPUG00000001375 ;GO:0016758", header = FALSE)
d[, paste(V2, collapse = "; "), by = V1]
# V1 V1
# 1: ENSIPUG00000001371 GO:0008236; GO:0008233; GO:0070011; GO:0016787; GO:0017171; GO:0140096
# 2: ENSIPUG00000001374 GO:0005515; GO:0003674; GO:0005488
# 3: ENSIPUG00000001375 GO:0008152; GO:0008150; GO:0016758
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
you don't need to load the data in R just for that. Try this:
if you want it in R, try one of these:
input: