In an output file, I want to preserve these lines:
Prevotella_sp
Leptospira_interrogans
Leptospira_interrogans
Escherichia_coli
yet get rid of the alphanumeric and underscore in front of the genus_species in these lines:
ADWS01000032_Escherichia_coli
EQ973222_Bacteroides_fragilis
AEEI01000021_Prevotella_marshii
AEXO01000076_Prevotella_denticola
EQ973222_Bacteroides_fragilis
ACIY01000543_Enterococcus_faecium
ACIY01000542_Enterococcus_faecium
I have tried this, the intention of which is to match at least two upper case letters and two numbers in a row with wildcards on each side until the underscore, among other things, to no avail.
sed 's/^[^*[A-Z]{2}[0-9]{2}*_]//' input.file > output.file
I appreciate any and all suggestions. I am still not very good with regex.
Bert Gold
I found the strings you want all contains single underscore in it, while the ones you don't want all contains double underscores in it...Thus it is pretty easy to separate them, do not even need to use regex. But I'm not good at sed, I don't know how to do it in sed.
but, in python, you can split the string by underscore into list, then filter it by the length of the results..
edit: seems I misunderstand your question... What you actually want is not a regex search, it's string manipulation...