Hi, I have a text file containing sequence IDs and COG identifiers in following format:
ID=ANJKHBIP_00005;eC_number=6.3.4.2;Name=pyrG_1;dbxref=COG:COG0504;gene=pyrG_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P0A7E5;locus_tag=ANJKHBIP_00005;product=CTP synthase
ID=ANJKHBIP_00037;eC_number=1.2.1.5;Name=puuC_1;dbxref=COG:COG1012;gene=puuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P23883;locus_tag=ANJKHBIP_00037;product=NADP/NAD-dependent aldehyde dehydrogenase PuuC
ID=ANJKHBIP_00038;eC_number=3.6.3.34;Name=fhuC_1;dbxref=COG:COG1120;gene=fhuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P07821;locus_tag=ANJKHBIP_00038;product=Iron(3+)-hydroxamate import ATP-binding protein FhuC
ID=ANJKHBIP_00043;Name=rpsJ_1;dbxref=COG:COG0051;gene=rpsJ_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q6N4T6;locus_tag=ANJKHBIP_00043;product=30S ribosomal protein S10
From this file I want to extract sequence IDs and COGs in tab separated format something like below:
ANJKHBIP_00005 COG0504
ANJKHBIP_00037 COG1012
and so on...........
I will highly appreciate if anyone can please suggest me a simple code to do this. I tried to use grep and cut options but could not get the things in proper format required above.
Did you try something in Perl or Python ?