Concatenating gmt files from msigdb
1
0
Entering edit mode
3.5 years ago

Hi everyone!

I have an issue when I'm trying to concatenate several .gmt files downloaded from MsigDB site. I want to customize a gmt file using the pathways that I'm intersted in from KEGG and REACTOME (eventually from GO BP). However, when I use the next line to store the result in a new gmt or txt file:

cat KEGG_APOPTOSIS.gmt KEGG_ABC_TRANSPORTERS.gmt > my_gmt.gmt

I got the next output:

KEGG_APOPTOSIS  > Apoptosis AIFM1   AKT1    AKT2    AKT3    APAF1   ATM BAD BAX BCL2    BCL2L1  BID BIRC2   BIRC3   CAPN1   CAPN2   CASP10  CASP3   CASP6   CASP7   CASP8   CASP9   CFLAR   CHP1    CHP2    CHUK    CSF2RB  CYCS    DFFA    DFFB    ENDOD1  ENDOG   EXOG    FADD    FAS FASLG   IKBKB   IKBKG   IL1A    IL1B    IL1R1   IL1RAP  IL3 IL3RA   IRAK1   IRAK2   IRAK3   IRAK4   MAP3K14 MYD88   NFKB1   NFKBIA  NGF NTRK1   PIK3CA  PIK3CB  PIK3CD  PIK3CG  PIK3R1  PIK3R2  PIK3R3  PIK3R5  PPP3CA  PPP3CB  PPP3CC  PPP3R1  PPP3R2  PRKACA  PRKACB  PRKACG  PRKAR1A PRKAR1B PRKAR2A PRKAR2B PRKX    RELA    RIPK1   TNF TNFRSF10A   TNFRSF10B   TNFRSF10C   TNFRSF10D   TNFRSF1A    TNFSF10 TP53    TRADD   TRAF2   XIAPKEGG_ABC_TRANSPORTERS   > ABC transporters  ABCA1   ABCA10  ABCA12  ABCA13  ABCA2   ABCA3   ABCA4   ABCA5   ABCA6   ABCA7   ABCA8   ABCA9   ABCB1   ABCB10  ABCB11  ABCB4   ABCB5   ABCB6   ABCB7   ABCB8   ABCB9   ABCC1   ABCC10  ABCC11  ABCC12  ABCC2   ABCC3   ABCC4   ABCC5   ABCC6   ABCC8   ABCC9   ABCD1   ABCD2   ABCD3   ABCD4   ABCG1   ABCG2   ABCG4   ABCG5   ABCG8   CFTR    TAP1    TAP2

And I need, in the new .gmt file, each pathway in each row like this:

KEGG_APOPTOSIS  > Apoptosis AIFM1   AKT1    AKT2    AKT3    APAF1   ATM BAD BAX BCL2    BCL2L1  BID BIRC2   BIRC3   CAPN1   CAPN2   CASP10  CASP3   CASP6   CASP7   CASP8   CASP9   CFLAR   CHP1    CHP2    CHUK    CSF2RB  CYCS    DFFA    DFFB    ENDOD1  ENDOG   EXOG    FADD    FAS FASLG   IKBKB   IKBKG   IL1A    IL1B    IL1R1   IL1RAP  IL3 IL3RA   IRAK1   IRAK2   IRAK3   IRAK4   MAP3K14 MYD88   NFKB1   NFKBIA  NGF NTRK1   PIK3CA  PIK3CB  PIK3CD  PIK3CG  PIK3R1  PIK3R2  PIK3R3  PIK3R5  PPP3CA  PPP3CB  PPP3CC  PPP3R1  PPP3R2  PRKACA  PRKACB  PRKACG  PRKAR1A PRKAR1B PRKAR2A PRKAR2B PRKX    RELA    RIPK1   TNF TNFRSF10A   TNFRSF10B   TNFRSF10C   TNFRSF10D   TNFRSF1A    TNFSF10 TP53    TRADD   TRAF2   XIAP
KEGG_ABC_TRANSPORTERS   > ABC transporters  ABCA1   ABCA10  ABCA12  ABCA13  ABCA2   ABCA3   ABCA4   ABCA5   ABCA6   ABCA7   ABCA8   ABCA9   ABCB1   ABCB10  ABCB11  ABCB4   ABCB5   ABCB6   ABCB7   ABCB8   ABCB9   ABCC1   ABCC10  ABCC11  ABCC12  ABCC2   ABCC3   ABCC4   ABCC5   ABCC6   ABCC8   ABCC9   ABCD1   ABCD2   ABCD3   ABCD4   ABCG1   ABCG2   ABCG4   ABCG5   ABCG8   CFTR    TAP1    TAP2

Thanks in advance!

ssGSEA RNAseq Pathway KEGG REACTOME • 2.6k views
ADD COMMENT
0
Entering edit mode
3.5 years ago

Hi, it is as if there is neither an end-line (\n) nor carriage return (\r) at the end of each line in each file. Perhaps you originally retrieved these files on Windows? If you view these in vi, what do you see [at the end of the line]? You could try to run dos2unix on each file before running cat.

This is also a temporary fix, if nothing else works:

cat KEGG_APOPTOSIS.gmt <(echo -e "\r") KEGG_ABC_TRANSPORTERS.gmt

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin!

Thanks for your answer. First of all I retrieved these files using Mac. Second, after visualizing the files with vi, I don't see if there is an end-line or carriage return at the end of each row in these files. Your suggested code worked well when concatenating just few files, I'm trying to figure out how to do it with the rest of them.

Thanks Kevin

ADD REPLY
1
Entering edit mode

I am unsure why all of your files are separated in this way, and also unsure about the end-line issue. In any case, if you have many gmt files in the current working directory, then concatenating all of them could be done with:

find . -name "*.gmt" | while read GMT ; do cat "${GMT}" <(echo -e "\r") ; done  | sed '/^[[:space:]]*$/d' ;

..same code but better structured:

find . -name "*.gmt" | while read GMT ;
do
  cat "${GMT}" <(echo -e "\r") ;
done | sed '/^[[:space:]]*$/d'
ADD REPLY
0
Entering edit mode

Once again Kevin ¡muchas gracias! thanks a lot! Your code worked very well with all the files. Last night, I was working in using for loops to concatenate only KEGG gmt files. However, my approach did not work with REACTOME and GOBP files... Perhaps is an issue related with the source file.

Rodo

ADD REPLY
0
Entering edit mode

¡No hay de que! / No problem, Rodo. Nos vemos.

ADD REPLY

Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6