Entering edit mode
3.8 years ago
MSRS
▴
590
Thank you very much for your patience. Thank you very much Ram. I want to merge GO data into fasta file for the DIAMOND database preparation. There is an existing code that works on the COG_2014 database. where the running code is ./merger.py prot2003-2014.fa cog2003-2014.csv cognames2003-2014.tab
like that. Here the input files look
- prot2003-2014.fa
>gi|103485500|ref|YP_615061.1| glutathione S-transferase-like protein [Sphingopyxis alaskensis RB2256] MKLFIGNKAYSSWSLRGWLAARHSGLPFEEVTVPLYNEEWNQRREGDEFAPSGGKVPILWDGADIVVWDSLAIIDYLNEK TGGTRGYWPDDMAARAMARSMAAEMHSSFAALRREHSMNIRRIYPAAELTPEVQADVIRILQIWAEARARFGGEGDYLFG DWSAADMMFAPVVTRFITYSIPLPRFALPYAQAVISHPHMQEWIGGAQAEDWVIEKFEGPVEG >gi|103485503|ref|YP_615064.1| hypothetical protein Sala_0005 [Sphingopyxis alaskensis RB2256] MIARLLILIARAWQLGPSRVLPPTCRYAPSCSEYAIVALRRHGAIKGGWIATKRLLRCHPWGGHGYDPVP >gi|103485501|ref|YP_615062.1| ribosome biogenesis GTP-binding protein YsxC [Sphingopyxis alaskensis RB2256] MSEIELEPGADPERAERARKLFSGPIAFLKSAPALQHLPVPSVPEIAFAGRSNVGKSSLLNALTNRNGLARTSVTPGRTQ ELNYFDVGEPPVFRLVDMPGYGFAKAPKDVVRKWRFLINDYLRGRQVLKRTLVLIDSRHGIKDVDRDVLEMLDTAAVSYR LVLTKADKIKASALADVHAATEAEARKHPAAHPEVIATSSEKGMGIAELRTAVLEAVEL
- cog2003-2014.csv
domain-id genome-name protein-id protein-length domain-start domain-end COG-id membership-class 103485501 Sphingopyxis_alaskensis_RB2256_uid58351 103485501 219 1 219 COG0218 0 103485500 Sphingopyxis_alaskensis_RB2256_uid58351 103485500 223 1 223 COG0625 0 103485503 Sphingopyxis_alaskensis_RB2256_uid58351 103485503 70 1 70 COG0759 0
- cognames2003-2014.tab
COG func name COG0625 O Glutathione S-transferase COG0218 D GTP-binding protein EngB required for normal cell division COG0759 M Membrane-anchored protein YidD, putatitve component of membrane protein insertase Oxa1/YidC/SpoIIIJ
for that existing python code is
#!/usr/bin/env python
import sys, time
prot_file = open(sys.argv[1], "r") # prot2003-2014.fa (note: unzipped)
cog_file = open(sys.argv[2], "r") # cog2003-2014.csv
cog_trans_file = open(sys.argv[3], "r") # cognames2003-2014.tab
outfile = open("merged_cogs.fa", "w")
# Building the dictionary to compare COGs to the GI ID
cog_db = {}
t0 = time.clock()
for line in cog_file:
split = line.split(",")
cog_db[split[0]] = split[6] # GI ID == COG ID
cog_file.close()
t1 = time.clock()
print "Cog file read. Time elapsed: " + str(t1-t0) + " seconds."
# Building the dictionary to add the function code to the COGs
trans_cog_db = {}
for line in cog_trans_file:
trans_cog_db[line.split("\t")[0]] = line.split("\t")[1]
cog_trans_file.close()
# Using the dictionary to write to the outfile the protein sequences with COG info
error_count = 0
for line in prot_file:
if line[0] == ">":
gi_id = line.strip().split("|")[1]
try:
outfile.write(line.strip() + " | " + cog_db[gi_id] + " | " + trans_cog_db[cog_db[gi_id]] + "\n")
except KeyError:
error_count += 1
outfile.write(line.strip() + " | NO COG FOUND | NA\n")
continue
else:
outfile.write(line)
t2 = time.clock()
prot_file.close()
outfile.close()
print "Protein file analyzed. Time elapsed: " + str(t2-t1) + " seconds."
print "Number of sequences without a COG: " + str(error_count)
The output file looks like
>gi|103485500|ref|YP_615061.1| glutathione S-transferase-like protein [Sphingopyxis alaskensis RB2256] | COG0625 | O
MKLFIGNKAYSSWSLRGWLAARHSGLPFEEVTVPLYNEEWNQRREGDEFAPSGGKVPILWDGADIVVWDSLAIIDYLNEK
TGGTRGYWPDDMAARAMARSMAAEMHSSFAALRREHSMNIRRIYPAAELTPEVQADVIRILQIWAEARARFGGEGDYLFG
DWSAADMMFAPVVTRFITYSIPLPRFALPYAQAVISHPHMQEWIGGAQAEDWVIEKFEGPVEG
>gi|103485503|ref|YP_615064.1| hypothetical protein Sala_0005 [Sphingopyxis alaskensis RB2256] | COG0759 | M
MIARLLILIARAWQLGPSRVLPPTCRYAPSCSEYAIVALRRHGAIKGGWIATKRLLRCHPWGGHGYDPVP
>gi|103485501|ref|YP_615062.1| ribosome biogenesis GTP-binding protein YsxC [Sphingopyxis alaskensis RB2256] | COG0218 | D
MSEIELEPGADPERAERARKLFSGPIAFLKSAPALQHLPVPSVPEIAFAGRSNVGKSSLLNALTNRNGLARTSVTPGRTQ
ELNYFDVGEPPVFRLVDMPGYGFAKAPKDVVRKWRFLINDYLRGRQVLKRTLVLIDSRHGIKDVDRDVLEMLDTAAVSYR
LVLTKADKIKASALADVHAATEAEARKHPAAHPEVIATSSEKGMGIAELRTAVLEAVEL
Now needs code modification for COG_2020
Here
- cog-20.fa
>AAM01497_1 Glutamate-1-semialdehyde aminotransferase [Methanopyrus kandleri AV19] MGYEDEFPESLELFKRAERVMPGGVSSPVRRFDPYPFYVERAEGSRLYTVDGHVLIDYCLAFGPLILGHAHPEVVEAVVERVREGFHYGTPTLPELKLAEKVVELVPNVEKVRLVNTGTEATMSAIRLARAYTGREKIVKFEGCYHGAHDAVLVRAGSGASELGAPDSPGIPESVAENTLVCPFNDVEAFVETVERFDEEIGAVIVEPVLGNAGCVPPDEEFLKVLREYCDGTERLLIFDEVITGFRLELGGAQEYYGIDADLVCLGKILGGGLPIGAFGGPEEYMSRVAPEGKVYQAGTFNGNPVSATAGLVTLEVLERERPYDELSSKAERLASALEDGLEDRGIEGVVNRVESMFQVYFGIEEVRDYADVNSADHDAFKRFHRELLEHGVWIAASNYEAWFLSIAHTETDLERTEEAFEEALDRLTG >AAM04025_1 glutamate-1-semialdehyde 2,1-aminomutase [Methanosarcina acetivorans C2A] MVSEVTLDKSRQMYEKAKTLIPGGVSSPVRAIKPYPFYTASADGSKIRDLDGNEYIDYCLAYGPAVLGHNHPVIKAAIKEQLDKGWLYGTPTELEVTLAEKVAGYYPSIDMLRFVSTGTEATMSALRLARGFTRKNKFIKIEGGFHGAHDAVLVKAGSGATTLGEPDSLGIPADFTKYTLQAPYNDIETMTTLVEKNRDDLAAVIIEPVLGNIGPILPLPGYLKELRKLTKENDVLLIFDEVITGFRLAMGGAQEYFGVVPDMTTLGKIVGGGLPIGVFGGRREIMEMIAPSGAVYQAGTFSGSPCSVAAGIAVLDYLKKEDIHAKLNSTGDYMRAVVSEIVEDEGLDYTVCGIASMFKIFFGAEPHNYQEALKCDKEGYLSFFHRMLANGVFLPPSQFETNFISAAHSEEDIEKTLEAYVENL >ABK77793_1 glutamate-1-semialdehyde aminotransferase [Cenarchaeum symbiosum A] MDLEREYRAKTGGSARIFARSKKYHVGGVSHNIRFYEPYPFVTRSASGKHLVDVDGNKYVDYWMGHWSLILGHAPAPVRSAVEGQLRRGWIHGTVNEQTMNLSEIIRGAVSVAEKTRYVTSGTEAVMYAARLARAHTGRKIIAKADGGWHGYASGLLKSVNWPYDVPESGGLVDEEHSISIPYNDLEGSLDVLGRAGDDLACVIIEPLLGGGGCIPADEDYLRGIQEFVHSRGALLVLDEIVTGFRFRFGCAYAAAGLDPDIVALGKIVGGGFPIGVICGKDEVMEISNTISHAKSDRAYIGGGTFSANPATMTAGAAALGELKKRKGTIYPRINSMGDDARDKLSKIFGNRVSVTGRGSLFMTHFVQDGAGRVSNAADAAACDVELLHRYHLDMITRDGIFFLPGKLGAISAAHSKADLKTMYSASERFAEGL
- cog-20.cog.csv
Gene_ID NCBI_Assembly_ID Protein_ID Protein_length COGfootprint Lengthfrootprint COG_ID reserved COG_membership_class PSI-BLAST_bit_score PSI-BLAST_e_value COG_length Protein_footprint_COG CENSYa_1168 GCA_000200715.1 ABK77793.1 434 1-434 434 COG0001 COG0001 0 342 1.86E-114 432 1-429 MK0280 GCA_000007185.1 AAM01497.1 430 1-430 430 COG0001 COG0001 0 570 1E-200 432 3-431 MA_0581 GCA_000007345.1 AAM04025.1 424 1-424 424 COG0001 COG0001 0 574 1E-200 432 3-426
cog-20.def.tab
COG func name Gene_associated Functional_pathway_COG PubMed_ID PDB_ID COG0001 H Glutamate-1-semialdehyde aminotransferase HemL Heme biosynthesis 2CFB
Want the output like
>AAM01497_1 Glutamate-1-semialdehyde aminotransferase [Methanopyrus kandleri AV19] | COG0001 | H
MGYEDEFPESLELFKRAERVMPGGVSSPVRRFDPYPFYVERAEGSRLYTVDGHVLIDYCLAFGPLILGHAHPEVVEAVVERVREGFHYGTPTLPELKLAEKVVELVPNVEKVRLVNTGTEATMSAIRLARAYTGREKIVKFEGCYHGAHDAVLVRAGSGASELGAPDSPGIPESVAENTLVCPFNDVEAFVETVERFDEEIGAVIVEPVLGNAGCVPPDEEFLKVLREYCDGTERLLIFDEVITGFRLELGGAQEYYGIDADLVCLGKILGGGLPIGAFGGPEEYMSRVAPEGKVYQAGTFNGNPVSATAGLVTLEVLERERPYDELSSKAERLASALEDGLEDRGIEGVVNRVESMFQVYFGIEEVRDYADVNSADHDAFKRFHRELLEHGVWIAASNYEAWFLSIAHTETDLERTEEAFEEALDRLTG
>AAM04025_1 glutamate-1-semialdehyde 2,1-aminomutase [Methanosarcina acetivorans C2A] | COG0001 | H
MVSEVTLDKSRQMYEKAKTLIPGGVSSPVRAIKPYPFYTASADGSKIRDLDGNEYIDYCLAYGPAVLGHNHPVIKAAIKEQLDKGWLYGTPTELEVTLAEKVAGYYPSIDMLRFVSTGTEATMSALRLARGFTRKNKFIKIEGGFHGAHDAVLVKAGSGATTLGEPDSLGIPADFTKYTLQAPYNDIETMTTLVEKNRDDLAAVIIEPVLGNIGPILPLPGYLKELRKLTKENDVLLIFDEVITGFRLAMGGAQEYFGVVPDMTTLGKIVGGGLPIGVFGGRREIMEMIAPSGAVYQAGTFSGSPCSVAAGIAVLDYLKKEDIHAKLNSTGDYMRAVVSEIVEDEGLDYTVCGIASMFKIFFGAEPHNYQEALKCDKEGYLSFFHRMLANGVFLPPSQFETNFISAAHSEEDIEKTLEAYVENL
>ABK77793_1 glutamate-1-semialdehyde aminotransferase [Cenarchaeum symbiosum A] | COG0001 | H
MDLEREYRAKTGGSARIFARSKKYHVGGVSHNIRFYEPYPFVTRSASGKHLVDVDGNKYVDYWMGHWSLILGHAPAPVRSAVEGQLRRGWIHGTVNEQTMNLSEIIRGAVSVAEKTRYVTSGTEAVMYAARLARAHTGRKIIAKADGGWHGYASGLLKSVNWPYDVPESGGLVDEEHSISIPYNDLEGSLDVLGRAGDDLACVIIEPLLGGGGCIPADEDYLRGIQEFVHSRGALLVLDEIVTGFRFRFGCAYAAAGLDPDIVALGKIVGGGFPIGVICGKDEVMEISNTISHAKSDRAYIGGGTFSANPATMTAGAAALGELKKRKGTIYPRINSMGDDARDKLSKIFGNRVSVTGRGSLFMTHFVQDGAGRVSNAADAAACDVELLHRYHLDMITRDGIFFLPGKLGAISAAHSKADLKTMYSASERFAEGL
Here, the Protein_Id is the identifier of fa file to CSV file and from CSV file to .tab file GO id is the identifier. Thank you very much.SR
What have you tried? You'd need to change a few lines in the python script based on which IDs map to the other inputs. Maybe try switching to pandas as you're dealing with a couple of tabular files?
Thank you very much Ram. I want to merge GO data into fasta file for the DIAMOND database preparation. Thank you.
The only difference I see is the change from the
gi
to theAA*
identifier between the two FASTA files. Change thegi_id
based indexing logic in the loop a little and test it. This is a good learning exercise, you should be able to solve this on your own.Thank you very much Ram. There are a few differences like from cog-20.fa file has the value like
AAM01497_1
but in CSV file contains the same character in the third column like thatAAM01497.1
.cog_db[split[3]] = split[7]
. I am stack and confused being a very beginning learner of python.Use a lot of
print
statements to understand what is going on. I am confident that you can figure this out.I am really sorry, I can not figure out the
Need expertise assistance.
That statement picks the
gi
identifier from the FASTA header.There is no gi in the new FASTA, but you can repurpose the code and logic to pick a different ID instead.
Thank you so much. Beg pardon if any mistake.