How to extract sequence ID and COG identifiers from a file?
3
0
Entering edit mode
6.4 years ago
Anuj Tyagi • 0

Hi, I have a text file containing sequence IDs and COG identifiers in following format:

ID=ANJKHBIP_00005;eC_number=6.3.4.2;Name=pyrG_1;dbxref=COG:COG0504;gene=pyrG_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P0A7E5;locus_tag=ANJKHBIP_00005;product=CTP synthase

ID=ANJKHBIP_00037;eC_number=1.2.1.5;Name=puuC_1;dbxref=COG:COG1012;gene=puuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P23883;locus_tag=ANJKHBIP_00037;product=NADP/NAD-dependent aldehyde dehydrogenase PuuC

ID=ANJKHBIP_00038;eC_number=3.6.3.34;Name=fhuC_1;dbxref=COG:COG1120;gene=fhuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P07821;locus_tag=ANJKHBIP_00038;product=Iron(3+)-hydroxamate import ATP-binding protein FhuC

ID=ANJKHBIP_00043;Name=rpsJ_1;dbxref=COG:COG0051;gene=rpsJ_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q6N4T6;locus_tag=ANJKHBIP_00043;product=30S ribosomal protein S10

From this file I want to extract sequence IDs and COGs in tab separated format something like below:

ANJKHBIP_00005   COG0504

ANJKHBIP_00037   COG1012

and so on...........

I will highly appreciate if anyone can please suggest me a simple code to do this. I tried to use grep and cut options but could not get the things in proper format required above.

next-gen annotation • 2.7k views
ADD COMMENT
0
Entering edit mode

Did you try something in Perl or Python ?

ADD REPLY
3
Entering edit mode
6.4 years ago
 tr ";" "\n" < input.txt | grep -E '^(ID|dbxref=COG:)' | cut -d '=' -f 2 | paste - -
ADD COMMENT
1
Entering edit mode

I don't think OP wants to keep COG: information

tr ";" "\n" < input.txt | grep -E '^(ID|dbxref=COG:)' | cut -d '=' -f 2 | cut -d ':' -f 2 | paste - -
ADD REPLY
0
Entering edit mode

Thanks a lot. it was keeping COG: information earlier. Now it is in exactly my required format.

ADD REPLY
3
Entering edit mode
6.4 years ago

I think you can do this trick with a single Unix command line but as you are actually learning (I guess) how to parse a text file, it's more interesting to understand the step by step process to parse any file at your convinience.

This is a python script.

#Make your new file ready
new_file = open("your_new_file.csv", "a")
#Open txt file
with open("your_file.txt", 'r') as cog_file:
    #For each line of this file
    for line in cog_file:
        #If the line start with ID
        if line.startswith('ID'):
            #Split the line on ; and put the result in an array
            items = line.split(";")
            #For each item in array
            for i in items:
                #You can get a key/value splitting on =
                key = i.split("=")[0]
                value = i.split("=")[1]
                #If the key is ID
                if key == "ID":
                    #Write it in your new file
                    new_file.write(value)
                #If the key is dbxref
                if key == "dbxref":
                    new_file.write("\t")
                    #Split the COG term on : to remove COG:
                    just_cog_code = value.split(":")[1]
                    #Write it in your new file
                    new_file.write(just_cog_code)
            #Line back
            new_file.write("\n")
ADD COMMENT
0
Entering edit mode

Hi Bastien, thanks for this script. Yes, I have been learning python but still not very familiar with file parsing. Interesting!

ADD REPLY
1
Entering edit mode
6.4 years ago

Assuming that gene is always after cog id:

$ grep -Po -i '(?<=ID=).*(?=;gene)' ids.text | sed 's/;.*:/\t/g'
ANJKHBIP_00005  COG0504
ANJKHBIP_00037  COG1012
ANJKHBIP_00038  COG1120
ANJKHBIP_00043  COG0051

with sed:

$ sed 's/ID=\(\w\+\).*\(COG[0-9]\+\).*/\1\t\2/g' ids.text 
ANJKHBIP_00005  COG0504
ANJKHBIP_00037  COG1012
ANJKHBIP_00038  COG1120
ANJKHBIP_00043  COG0051
ADD COMMENT
0
Entering edit mode

Thanks. yes it does the job.

ADD REPLY

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6