Split comma seperated list of GO terms into multiple rows and maintain gene identifier in each
3
0
Entering edit mode
5.7 years ago
Biogeek ▴ 470

I have a tab separated dataset, although the GO terms are comma separated.

GENEID1   GO:XXXXX,GO:YYYYYY,GO:ZZZZZZ

I want to make it so that the dataset becomes a tab-seperated dataset where each GO term is represented on a new line with the gene identifier:

GENEID1  GO:XXXXX
GENEID1  GO:YYYYYY
GENEID1  GO:ZZZZZZ

Many thanks.

Gene ontology data manipulation • 2.3k views
ADD COMMENT
0
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

Is it all tab separated? there are what look like tabs and commas in your example input.

ADD REPLY
0
Entering edit mode

A perl one liner could be

perl -ane '{print map {$F[0]."\t".$_."\n" } @F[1..$#F] }' your_input_file |sed -s 's/,$//'

ADD REPLY
1
Entering edit mode

This outputs the same as the input if your_input_file is created with echo -e "GENEID1\tGO:XXXXX,GO:YYYYYY,GO:ZZZZZZ" > your_input_file. Might there be a slight typo?

ADD REPLY
3
Entering edit mode
5.7 years ago
jean.elbers ★ 1.7k

Do you want to do this in R (possible) or other tools (also possible)?

echo -e "GENEID1\tGO:XXXXX,GO:YYYYYY,GO:ZZZZZZ" > test.txt

in R

library("tidyr")
test <- read.table("test.txt",sep = "\t",header=F)
test
V1                           V2
1 GENEID1 GO:XXXXX,GO:YYYYYY,GO:ZZZZZZ

# use tidyr separate rows to  convert A1\tGO:1,GO:2 to
#                                     A1  GO:1
#                                     A1  GO:2
test2 <- tidyr::separate_rows(data = test,V2,sep = ",")

test2
V1        V2
1 GENEID1  GO:XXXXX
2 GENEID1 GO:YYYYYY
3 GENEID1 GO:ZZZZZZ
ADD COMMENT
0
Entering edit mode

Many thanks for the R version!

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
2
Entering edit mode
5.7 years ago
microfuge ★ 2.0k

My Apologies for the mistake Can you please try this -

cat your_input_file |perl -ane '{print map {$F[0]."\t".$_."\n" } split (/,/,$F[1]) }'

ADD COMMENT
0
Entering edit mode

Perfect! This works. Thanks very much!

ADD REPLY
1
Entering edit mode
5.7 years ago
st.ph.n ★ 2.7k
#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
    for line in f:
        for n in range(len(line.strip().split('\t')[1].split(','))):
            print line.strip().split('\t')[0] + 't' + line.strip().split('\t')[1].split(',')[n]

Save as go_tab.py, run as python go_tabl.py input.txt > output.txt

ADD COMMENT
0
Entering edit mode

Many thanks for the python version also!

ADD REPLY

Login before adding your answer.

Traffic: 1063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6