I want to create a .gmt file for visualizing Gene enrichment sets in the EnrichmentMap plugin for cytoscape
My input file is like this:
scigt000016 GO:0005515
scigt000021 GO:0005515
scigt000027 GO:0044464
scigt000010 GO:0005515
scigt000011 GO:0015074
And I want to convert it like this:
GO:0005515 NA scigt000016 scigt000021 scigt000010
GO:0044464 NA scigt000027
GO:0015074 NA scigt000011
So basically putting the GO-term in column 1, some random text in col2 and the genes from col1 in the input file on a line after the go-term.
I was thinking to use a for loop that greps the go-term from column 2 in the input file, then for each line append col1 to the end of the line. But really I am quite lost here.
Beautiful solution, thanks!
this command is great but in long line give truncate output damage the format.
What is happening with signposting the colon in this one liner? As when I use this code with gene-set names which contain a colon, it works beautifully. However (to avoid downstream hiccups in R), I tend to strip non-alphanumeric characters from my variables. If I run the code with those gene-set names, the "NA" appears somewhere in the middle of the gene-set names. So what is awk doing with (NR==1?"":"\n") and is there a way of producing the same result with gene-set names that lack the colon, but same format conversion of gene_name "\t" gene_set -> gene_set "\t" "NA" "\t" gene_name ?
Thanks!