Given this fasta header as an example, ">WP_024130427.1 [50S ribosomal protein L16]-arginine 3-hydroxylase [Citrobacter koseri]", my goal is to extract accession number, protein annotation, and organism name separately.
Right now, I am able to extract the accession number >WP_024130427.1 and "[50S ribosomal protein L16]-arginine 3-hydroxylase [Citrobacter koseri]".
I have problems separating the remaining two parts: protein annotation ([50S ribosomal protein L16]-arginine 3-hydroxylase), organism name ("Citrobacter koseri").
The main issue is about square brackets. For this example, it is easy to tokenize the parts. However, given the various style of using square brackets, e.g. (so many variations to consider and below is not comprehensive and I am able to extract from the content below. Still my tokenization doesn't work for all the sequences.),
- >WP_011200935.1 cysteine synthase A [[Mannheimia] succiniciproducens]
- >WP_024130427.1 [50S ribosomal protein L16]-arginine 3-hydroxylase [Citrobacter koseri]
- >WP_011742684.1 [FeFe] hydrogenase H-cluster radical SAM maturase HydG [Caldanaerobacter subterraneus]
Is there a better way to extract the annotation and organism names respectively given the unpredictable usage of the square brackets?
Right now, the only way I can think of is to go backward from the ending of the string instead of using the pattern. Ensure numbers of ] and [ match. This method will work but I am wondering if there will be better ways.