Find:
^([A-Z][A-Z][0-9].*?) .*(chromosome) ([0-9]+).*
Replace:
$1_$2_$3
Apologies I think I misunderstood your post the first time around. You dont have 300 lines, you have 300 organisms (each with many lines).
Here, there are two approaches I can think of ...
First, I think this issue can be considered a bioinformatics problem, which is in part why it is easily soluble.
Obtain a list of Unique Identifiers exactly corresponding to those found in ALL your genome fasta files using $1 from the regex like the one I had in the original answer above. For instance, CM10030304.1, CM42994.1 on and on and on.
Go to a large database like RefSeq. For every UID in your list above, you want to now pull associated records from nuccore. This may be done, for instance, using eFetch.
Using any method of accessing [nuccore][2]
programmatically that you would like (like eFetch), obtain the records linked to the unique identifier.
Finally, process the output from 2. in such a way that you have a hashtable-like object in which the keys are the UID, and the values are the linked records. For instance, in Python3
, a Dict()
comprehension would work, such that you have something like:
Key:
UID=CM025336.1
Values:
/bp:1..19704315
/organism="Vespulapensylvanica"
/mol_type="genomicDNA"
/isolate="Volc-1"
/db_xref="taxon:30213"
/chromosome="1"
/sex="male"
/tissue_type="Wholeadult"
/country="USA:Volcano,HI"
/lat_lon="19.43N155.21W"
/collection_date="2017-08"
Finally, simply loop over all lines in all 300 genome fasta files. For instance in python3
, we could write:
#usr/bin/python
import re
allLines=[l.strip() for l in f for f in fasta.readlines() for fasta in fastas]
for ls in allLines:
ls=ls.strip()
nuccoreUID=ls.re("regular expression that generates $1 in my answer above")
shrinkingLine=ls
for nuccoreAttribute in allAttributeDict[nuccoreUID]:
shrinkingLine=shrinkingLine.replace(nuccoreAttribute, "")
Let me know if you have further questions, but this should more or less take care of everything. shrinkingLine
should now be a string containing ONLY the parts of your string that were precisely the hardest for which to define matches, as you note in your original post (i.e., LG10, chromosome some such, scaffold what-have-you, etc.). at this point, your computer should be holding everything in memory that you need to do what you propose in your post easily.
To close, I think there are other approaches to this problem, for instance an approach based on information entropy, that could be used. I do think that those and other approaches would not be a bioinformatics problem (more akin to statistical learning in that case).
Can you explain to a computer how to pick these "scaffold identity" terms? Will it always be "chromosome" followed by a space and a string that ends in a comma/the word scaffold followed by another scaffold with a number/alphanumeric string attached to it that ends at the word boundary/the phrase "linkage group" followed by the required idenfier that ends with a comma? If you need to eyeball the header to pick the identifier, the computer cannot be automated to do that for you.
This isn't so much a bioinformatics problem as a general regex problem.
i dont think this is necessarily true. in fact, precisely because it IS a bioinformatics problem, a clean solution is possible...