how to write scripts to split GO terms as one per line
3
0
Entering edit mode
7.3 years ago
Anny ▴ 30

Hi all,

I got a file with the first column containing id and second column containing annotated gene ontology numbers. As the following

CPIW_00004002-RA    GO:0005515
CPIW_00004002-RA    GO:0010997|GO:0097027|GO:1904668
CPIW_00004003-RA    GO:0003824|GO:0008152
CPIW_00004003-RA    GO:0003987|GO:0016208|GO:0019427
CPIW_00004004-RA    GO:0006506|GO:0016021|GO:0016758
CPIW_00004005-RA    GO:0004360|GO:1901137
CPIW_00004005-RA    GO:0097367|GO:1901135
CPIW_00004006-RA    GO:0005515
CPIW_00004007-RA    GO:0016787
CPIW_00004016-RA    GO:0003824|GO:0046872

I want to split them as one id with one GO term, as

CPIW_00004002-RA    GO:0005515
CPIW_00004002-RA    GO:0010997
CPIW_00004002-RA      GO:0097027
CPIW_00004002-RA       GO:1904668
CPIW_00004003-RA    GO:0003824
CPIW_00004003-RA    GO:0008152

How to write a script to make this work?

Thanks!

Alexie

python linux perl • 1.6k views
ADD COMMENT
0
Entering edit mode

This is a programming question, not a bioinformatics one. Ask on StackOverflow.

ADD REPLY
2
Entering edit mode
7.3 years ago
 awk '{n=split($2,a,/\|/); for(i=1;i<=n;++i) print $1,a[i];}' input.txt
ADD COMMENT
2
Entering edit mode
7.3 years ago
st.ph.n ★ 2.7k
#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        for line in f:
                for i in line.strip().split('\t')[1].split('|'):
                        print line.strip().split('\t')[0], '\t', i
ADD COMMENT
1
Entering edit mode
7.3 years ago
Chirag Parsania ★ 2.0k

Using R

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
dat <- apply(dat,2,as.character)
out <- apply(dat,1,function(elem){
        geneId <- elem[1]
        goIds <- elem[2]
        splitted <-  unlist(strsplit(goIds,'\\|',))
        return(cbind(geneID=rep(geneId,length(splitted)), splitted))
})

do.call("rbind",out)

EDIT: 2 Alternative, only few lines

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
out <- lapply(as.character(dat$V2),function(elem){unlist(strsplit(elem,"\\|"))})
cbind(rep(as.character(dat$V1),lengths(out)), unlist(out))
ADD COMMENT

Login before adding your answer.

Traffic: 1727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6