Question

How to extract sequence ID and COG identifiers from a file?

0

Entering edit mode

6.4 years ago

Anuj Tyagi • 0

Hi, I have a text file containing sequence IDs and COG identifiers in following format:

ID=ANJKHBIP_00005;eC_number=6.3.4.2;Name=pyrG_1;dbxref=COG:COG0504;gene=pyrG_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P0A7E5;locus_tag=ANJKHBIP_00005;product=CTP synthase

ID=ANJKHBIP_00037;eC_number=1.2.1.5;Name=puuC_1;dbxref=COG:COG1012;gene=puuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P23883;locus_tag=ANJKHBIP_00037;product=NADP/NAD-dependent aldehyde dehydrogenase PuuC

ID=ANJKHBIP_00038;eC_number=3.6.3.34;Name=fhuC_1;dbxref=COG:COG1120;gene=fhuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P07821;locus_tag=ANJKHBIP_00038;product=Iron(3+)-hydroxamate import ATP-binding protein FhuC

ID=ANJKHBIP_00043;Name=rpsJ_1;dbxref=COG:COG0051;gene=rpsJ_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q6N4T6;locus_tag=ANJKHBIP_00043;product=30S ribosomal protein S10

From this file I want to extract sequence IDs and COGs in tab separated format something like below:

ANJKHBIP_00005   COG0504

ANJKHBIP_00037   COG1012

and so on...........

I will highly appreciate if anyone can please suggest me a simple code to do this. I tried to use grep and cut options but could not get the things in proper format required above.

next-gen annotation • 2.7k views

ADD COMMENT • link updated 6.4 years ago by Bastien Hervé 5.9k • written 6.4 years ago by Anuj Tyagi • 0

0

Entering edit mode

Did you try something in Perl or Python ?

ADD REPLY • link 6.4 years ago by Bastien Hervé 5.9k

score 3 · Accepted Answer · 2018-07-06

3

Entering edit mode

6.4 years ago

Pierre Lindenbaum 164k

 tr ";" "\n" < input.txt | grep -E '^(ID|dbxref=COG:)' | cut -d '=' -f 2 | paste - -

ADD COMMENT • link 6.4 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

I don't think OP wants to keep COG: information

tr ";" "\n" < input.txt | grep -E '^(ID|dbxref=COG:)' | cut -d '=' -f 2 | cut -d ':' -f 2 | paste - -

ADD REPLY • link 6.4 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Thanks a lot. it was keeping COG: information earlier. Now it is in exactly my required format.

ADD REPLY • link 6.4 years ago by Anuj Tyagi • 0

score 3 · Accepted Answer · 2018-07-06

I think you can do this trick with a single Unix command line but as you are actually learning (I guess) how to parse a text file, it's more interesting to understand the step by step process to parse any file at your convinience.

This is a python script.

#Make your new file ready
new_file = open("your_new_file.csv", "a")
#Open txt file
with open("your_file.txt", 'r') as cog_file:
    #For each line of this file
    for line in cog_file:
        #If the line start with ID
        if line.startswith('ID'):
            #Split the line on ; and put the result in an array
            items = line.split(";")
            #For each item in array
            for i in items:
                #You can get a key/value splitting on =
                key = i.split("=")[0]
                value = i.split("=")[1]
                #If the key is ID
                if key == "ID":
                    #Write it in your new file
                    new_file.write(value)
                #If the key is dbxref
                if key == "dbxref":
                    new_file.write("\t")
                    #Split the COG term on : to remove COG:
                    just_cog_code = value.split(":")[1]
                    #Write it in your new file
                    new_file.write(just_cog_code)
            #Line back
            new_file.write("\n")

score 1 · Accepted Answer · 2018-07-06

1

Entering edit mode

6.4 years ago

cpad0112 21k

Assuming that gene is always after cog id:

$ grep -Po -i '(?<=ID=).*(?=;gene)' ids.text | sed 's/;.*:/\t/g'
ANJKHBIP_00005  COG0504
ANJKHBIP_00037  COG1012
ANJKHBIP_00038  COG1120
ANJKHBIP_00043  COG0051

with sed:

$ sed 's/ID=\(\w\+\).*\(COG[0-9]\+\).*/\1\t\2/g' ids.text 
ANJKHBIP_00005  COG0504
ANJKHBIP_00037  COG1012
ANJKHBIP_00038  COG1120
ANJKHBIP_00043  COG0051