Question

how to write scripts to split GO terms as one per line

0

Entering edit mode

8.1 years ago

Anny ▴ 30

Hi all,

I got a file with the first column containing id and second column containing annotated gene ontology numbers. As the following

CPIW_00004002-RA    GO:0005515
CPIW_00004002-RA    GO:0010997|GO:0097027|GO:1904668
CPIW_00004003-RA    GO:0003824|GO:0008152
CPIW_00004003-RA    GO:0003987|GO:0016208|GO:0019427
CPIW_00004004-RA    GO:0006506|GO:0016021|GO:0016758
CPIW_00004005-RA    GO:0004360|GO:1901137
CPIW_00004005-RA    GO:0097367|GO:1901135
CPIW_00004006-RA    GO:0005515
CPIW_00004007-RA    GO:0016787
CPIW_00004016-RA    GO:0003824|GO:0046872

I want to split them as one id with one GO term, as

CPIW_00004002-RA    GO:0005515
CPIW_00004002-RA    GO:0010997
CPIW_00004002-RA      GO:0097027
CPIW_00004002-RA       GO:1904668
CPIW_00004003-RA    GO:0003824
CPIW_00004003-RA    GO:0008152

How to write a script to make this work?

Thanks!

Alexie

python linux perl • 2.0k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 8.1 years ago by Anny ▴ 30

0

Entering edit mode

This is a programming question, not a bioinformatics one. Ask on StackOverflow.

ADD REPLY • link 8.1 years ago by Jean-Karim Heriche 27k

1

Entering edit mode

8.1 years ago

Chirag Parsania ★ 2.0k

Using R

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
dat <- apply(dat,2,as.character)
out <- apply(dat,1,function(elem){
        geneId <- elem[1]
        goIds <- elem[2]
        splitted <-  unlist(strsplit(goIds,'\\|',))
        return(cbind(geneID=rep(geneId,length(splitted)), splitted))
})

do.call("rbind",out)

EDIT: 2 Alternative, only few lines

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
out <- lapply(as.character(dat$V2),function(elem){unlist(strsplit(elem,"\\|"))})
cbind(rep(as.character(dat$V1),lengths(out)), unlist(out))

ADD COMMENT • link 8.1 years ago by Chirag Parsania ★ 2.0k

score 2 · Accepted Answer · 2017-07-26

2

Entering edit mode

8.1 years ago

Pierre Lindenbaum 166k

 awk '{n=split($2,a,/\|/); for(i=1;i<=n;++i) print $1,a[i];}' input.txt

ADD COMMENT • link 8.1 years ago by Pierre Lindenbaum 166k

score 2 · Accepted Answer · 2017-07-26

2

Entering edit mode

8.1 years ago

st.ph.n ★ 2.7k

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        for line in f:
                for i in line.strip().split('\t')[1].split('|'):
                        print line.strip().split('\t')[0], '\t', i

ADD COMMENT • link 8.1 years ago by st.ph.n ★ 2.7k