Question

Sorting repetitive Gene Ontology entries in one line

0

Entering edit mode

9 months ago

resug ▴ 40

Dear Biostars community,

I have a tab-separated file with two columns, the first column contains gene names, the second column contains Gene Ontology IDs separated by commas. What I need to do is to keep one gene name with one unique Gene Ontology ID in one raw only, discarding repetitiveness and multiple entries.

From this

Pr_g33687.t1    GO:0003735,GO:0003735,GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625
Pr_g33687.t1    GO:0003735,GO:0009129,GO:0006412
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565

into this

Pr_g33687.t1    GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625,GO:0009129
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605,GO:0006355,GO:0043565

Thank you,
Rom

Gene-Ontology • 988 views

ADD COMMENT • link updated 9 months ago by Ram 45k • written 9 months ago by resug ▴ 40

0

Entering edit mode

What have you tried? This is a pretty straightforward operation using dplyr's group_by and summarise.

ADD REPLY • link 9 months ago by Ram 45k

0

Entering edit mode

I am trying to use dplyr to process my data after seeing your comment but it's not straightforward. All the examples I found process different type of data. I will continue diving into this. Perhaps other methods besides R would help too. Thanks.

ADD REPLY • link 9 months ago by resug ▴ 40

0

Entering edit mode

You are trying to create one group per column-1 and comma-separate unique values of column-2 combined. So first you'll need to comma-split your col2 so each col1-col2 value gets its own row, get the unique set of pairs from this dataset and then group by col1 and comma-delimit-combine col2 values.

It is not simple but it is straightforward.

ADD REPLY • link 9 months ago by Ram 45k

0

Entering edit mode

That makes sense. The detailed explanation helps a lot. Thanks for your help.

ADD REPLY • link 9 months ago by resug ▴ 40

score 2 · Accepted Answer · 2024-06-27

2

Entering edit mode

9 months ago

resug ▴ 40

I also found a solution for this question with the following script found elsewhere:

$ awk -F'\t' '{n=split($2,a,","); for(i=1;i<=n;i++) print $1 FS a[i]}' input.tsv | datamash -g1 unique 2 > output.tsv