Sorting repetitive Gene Ontology entries in one line
1
0
Entering edit mode
5 months ago
resug ▴ 40

Dear Biostars community,

I have a tab-separated file with two columns, the first column contains gene names, the second column contains Gene Ontology IDs separated by commas. What I need to do is to keep one gene name with one unique Gene Ontology ID in one raw only, discarding repetitiveness and multiple entries.

From this

Pr_g33687.t1    GO:0003735,GO:0003735,GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625
Pr_g33687.t1    GO:0003735,GO:0009129,GO:0006412
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565
Pr_g15244.t1    GO:0003700,GO:0006355,GO:0043565

into this

Pr_g33687.t1    GO:0003735,GO:0005840,GO:0006412,GO:0022618,GO:0022625,GO:0009129
Pr_g15244.t1    GO:0000978,GO:0003700,GO:0005634,GO:0006357,GO:0034605,GO:0006355,GO:0043565

Thank you,
Rom

Gene-Ontology • 700 views
ADD COMMENT
0
Entering edit mode

What have you tried? This is a pretty straightforward operation using dplyr's group_by and summarise.

ADD REPLY
0
Entering edit mode

I am trying to use dplyr to process my data after seeing your comment but it's not straightforward. All the examples I found process different type of data. I will continue diving into this. Perhaps other methods besides R would help too. Thanks.

ADD REPLY
0
Entering edit mode

You are trying to create one group per column-1 and comma-separate unique values of column-2 combined. So first you'll need to comma-split your col2 so each col1-col2 value gets its own row, get the unique set of pairs from this dataset and then group by col1 and comma-delimit-combine col2 values.

It is not simple but it is straightforward.

ADD REPLY
0
Entering edit mode

That makes sense. The detailed explanation helps a lot. Thanks for your help.

ADD REPLY
2
Entering edit mode
5 months ago
resug ▴ 40

I also found a solution for this question with the following script found elsewhere:

$ awk -F'\t' '{n=split($2,a,","); for(i=1;i<=n;i++) print $1 FS a[i]}' input.tsv | datamash -g1 unique 2 > output.tsv
ADD COMMENT
0
Entering edit mode

Nice! Please go ahead and accept your answer to provide closure to the post.

ADD REPLY

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6