Dear All,
I am trying to correct a batch of old embl files to the new agreed format.
currently agreed format for the Pfam domains is:
/inference="protein motif:PFAM:PF03466" as an example , we can have multiple inferences per entry , but not repeat domains i.e. if there are repeat domains , we should just have one /inference="..." per entry.
/note=*domain "HMMPfam:PF09339;HTH_lclR;2e-05;codon 269-306"
so this would become /Inference="protein motif:Pfam:PF09339"
and duplicates per entry should be removed.
I did this with perl regex but now when I converted to /inference etc, but sometimes there was originally a second line which didn’t get converted with the script, so we have something like
FT “495-678”
And several other pieces of comments, the problem is that is very different what can be found there, so it is very difficult to pick out with a regex I am thinking. This will prevent the embl file from being valid.
Any help would be appreciated. I am also attaching a small file below:
ID Lsalivarius_cp400_4_358425-448092; SV 1; linear; unassigned DNA; STD; UNC; 89668 BP.
XX
FH Key Location/Qualifiers
FH
FT source 1..89668
FT /note="scaffold4|size89668"
FT CDS complement(671..1594)
FT /note="*GO: aspect=; GOid=GO:; term=; evidence=IEA;
FT date=20121112"
FT /note="*GO: aspect=Component; GOid=GO:0016020;
FT term=membrane; evidence=IEA; date=20121112"
FT /note="*db_xref: 07-11-2012"
FT /note="*db_xref: Membrane insertion protein, OxaA/YidC"
FT /note="*domain: PANTHER:PTHR12428;IPR001708;6.4E-35;codon
FT 35-247"
FT /note="*domain: PANTHER:PTHR12428:SF11;T;6.4E-35;codon
FT 35-247"
FT /note="*domain: Pfam:PF02096;60Kd inner membrane
FT protein;9.8E-45;codon 57-247"
FT /note="*domain: PRINTS:PR00701;60kDa inner membrane protein
FT signature;8.6E-6;codon 131-154"
FT /note="*domain: PRINTS:PR00701;60kDa inner membrane protein
FT signature;8.6E-6;codon 214-237"
FT /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT membrane-bound protein predicted to be embedded in the
FT membrane.;-;codon 230-249"
FT /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT membrane-bound protein predicted to be embedded in the
FT membrane.;-;codon 208-224"
FT /note="*domain: Phobius:NON_CYTOPLASMIC_DOMAIN;Region of a
FT membrane-bound protein predicted to be outside the
FT membrane, in the extracellular region.;-;codon 157-175"
FT /note="*domain: Phobius:NON_CYTOPLASMIC_DOMAIN;Region of a
FT membrane-bound protein predicted to be outside the
FT membrane, in the extracellular region.;-;codon 27-49"
FT /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT membrane-bound protein predicted to be embedded in the
FT membrane.;-;codon 131-156"
FT /note="*domain: Phobius:CYTOPLASMIC_DOMAIN;Region of a
FT membrane-bound protein predicted to be outside the
FT membrane, in the cytoplasm.;-;codon 197-207"
FT /note="*domain: Phobius:SIGNAL_PEPTIDE_H_REGION;Hydrophobic
FT region of a signal peptide.;-;codon 8-20"
FT /note="*domain: Phobius:NON_CYTOPLASMIC_DOMAIN;Region of a
FT membrane-bound protein predicted to be outside the
FT membrane, in the extracellular region.;-;codon 225-229"
FT /note="*domain: Phobius:SIGNAL_PEPTIDE_C_REGION;C-terminal
FT region of a signal peptide.;-;codon 21-26"
FT /note="*domain: Phobius:SIGNAL_PEPTIDE;Signal peptide
FT region;-;codon 1-26"
FT /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT membrane-bound protein predicted to be embedded in the
FT membrane.;-;codon 50-74"
FT /note="*domain: Phobius:CYTOPLASMIC_DOMAIN;Region of a
FT membrane-bound protein predicted to be outside the
FT membrane, in the cytoplasm.;-;codon 250-307"
FT /note="*domain: Phobius:TRANSMEMBRANE;Region of a
FT membrane-bound protein predicted to be embedded in the
FT membrane.;-;codon 176-196"
FT /note="*domain: Phobius:CYTOPLASMIC_DOMAIN;Region of a
FT
I can send you the entire file.
Thanks CS
wrap the file in the code format. It would be easier to visualize. there will be an icon with 101010 on it.
Hi Bharat , finally i found the way to view it clearly. Can you please help me from here?
Thanks CS
Be sure to specify exactly what you have now and what you want the change to be. You say "/inference" and "/Inference" above but I'm guessing it should be the former. Also, there is no "/domain" tag in the example you show, but there is a "note" tag specifying the domain (i.e., "/note="*domain:..."), is that what you want to change?
Hi SES,
Thanks for the heads up. I have : /note=*domain "HMMPfam:PF09339;HTH_lclR;2e-05;codon 269-306"
so this would become /Inference="protein motif:Pfam:PF09339" I only want Pfam domains to be the part of final files.
Thanks
Have you tried contacting ENA about this directly (via datasubs@ebi.ac.uk)? (To be honest, I think removing the ability to store positions of the matches and the number of matches seems a bit strange/silly!)
Thanks Sarah,
I have written to ENA about this issue. Lets see what I get from them. CS