Make a .gmt file using bash for loop
2
2
Entering edit mode
10.0 years ago
jon.brate ▴ 310

I want to create a .gmt file for visualizing Gene enrichment sets in the EnrichmentMap plugin for cytoscape

My input file is like this:

scigt000016    GO:0005515
scigt000021    GO:0005515
scigt000027    GO:0044464
scigt000010    GO:0005515
scigt000011    GO:0015074

And I want to convert it like this:

GO:0005515 NA scigt000016 scigt000021 scigt000010
GO:0044464 NA scigt000027
GO:0015074 NA scigt000011

So basically putting the GO-term in column 1, some random text in col2 and the genes from col1 in the input file on a line after the go-term.

I was thinking to use a for loop that greps the go-term from column 2 in the input file, then for each line append col1 to the end of the line. But really I am quite lost here.

bash gmt • 5.2k views
ADD COMMENT
6
Entering edit mode
10.0 years ago

sort on 2nd column, use awk to retrieve the 'previous' value of the 2nd column. If it isn't the same, start a new line.

sort -k2,2 input.txt  | awk '{if($2!=prev) {printf("%s%s\tNA\t%s",(NR==1?"":"\n"),$2,$1);prev=$2;next;} else printf(" %s",$1);} END{printf("\n");}'

GO:0005515    NA     scigt000010 scigt000016 scigt000021
GO:0015074    NA     scigt000011
GO:0044464    NA     scigt000027
ADD COMMENT
0
Entering edit mode

Beautiful solution, thanks!

ADD REPLY
0
Entering edit mode

this command is great but in long line give truncate output damage the format.

ADD REPLY
0
Entering edit mode

What is happening with signposting the colon in this one liner? As when I use this code with gene-set names which contain a colon, it works beautifully. However (to avoid downstream hiccups in R), I tend to strip non-alphanumeric characters from my variables. If I run the code with those gene-set names, the "NA" appears somewhere in the middle of the gene-set names. So what is awk doing with (NR==1?"":"\n") and is there a way of producing the same result with gene-set names that lack the colon, but same format conversion of gene_name "\t" gene_set -> gene_set "\t" "NA" "\t" gene_name ?

Thanks!

ADD REPLY
1
Entering edit mode
10.0 years ago
Zhilong Jia ★ 2.2k

imagine aa.t is the input file and bb.t is the output

awk 'BEGIN{FS=OFS="\t"}{data[$1] = data[$1] "\t" $2}END{for (i in data) print i, "NA", data[i]}' aa.t | sed s/\\t\\t/\\t/ - > bb.t

EDIT: the sep should be \t in the input file.

ADD COMMENT

Login before adding your answer.

Traffic: 1746 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6