Entering edit mode
5 months ago
resug
▴
40
Dear Biostars community,
I have a tab-separated file with two columns, the first column contains gene names, the second column contains Gene Ontology IDs separated by commas. What I need to do is to keep one gene name with one unique Gene Ontology ID in one raw only, discarding repetitiveness and multiple entries.
From this
Pr_g33687.t1 GO:0003735,GO:0003735,GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625
Pr_g33687.t1 GO:0003735,GO:0009129,GO:0006412
Pr_g15244.t1 GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605
Pr_g15244.t1 GO:0003700,GO:0006355,GO:0043565
Pr_g15244.t1 GO:0003700,GO:0006355,GO:0043565
into this
Pr_g33687.t1 GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625,GO:0009129
Pr_g15244.t1 GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605,GO:0006355,GO:0043565
Thank you,
Rom
What have you tried? This is a pretty straightforward operation using dplyr's
group_by
andsummarise
.I am trying to use dplyr to process my data after seeing your comment but it's not straightforward. All the examples I found process different type of data. I will continue diving into this. Perhaps other methods besides R would help too. Thanks.
You are trying to create one group per column-1 and comma-separate unique values of column-2 combined. So first you'll need to comma-split your col2 so each col1-col2 value gets its own row, get the unique set of pairs from this dataset and then group by col1 and comma-delimit-combine col2 values.
It is not simple but it is straightforward.
That makes sense. The detailed explanation helps a lot. Thanks for your help.