sed regex code help wanted
2
0
Entering edit mode
6.3 years ago
bgold04 • 0

In an output file, I want to preserve these lines:

Prevotella_sp 
Leptospira_interrogans 
Leptospira_interrogans
Escherichia_coli

yet get rid of the alphanumeric and underscore in front of the genus_species in these lines:

ADWS01000032_Escherichia_coli 
EQ973222_Bacteroides_fragilis     
AEEI01000021_Prevotella_marshii     
AEXO01000076_Prevotella_denticola     
EQ973222_Bacteroides_fragilis     
ACIY01000543_Enterococcus_faecium     
ACIY01000542_Enterococcus_faecium

I have tried this, the intention of which is to match at least two upper case letters and two numbers in a row with wildcards on each side until the underscore, among other things, to no avail.

sed 's/^[^*[A-Z]{2}[0-9]{2}*_]//' input.file > output.file

I appreciate any and all suggestions. I am still not very good with regex.

Bert Gold

regular expressions sed microbiome • 1.7k views
ADD COMMENT
0
Entering edit mode

I found the strings you want all contains single underscore in it, while the ones you don't want all contains double underscores in it...Thus it is pretty easy to separate them, do not even need to use regex. But I'm not good at sed, I don't know how to do it in sed.

but, in python, you can split the string by underscore into list, then filter it by the length of the results..

edit: seems I misunderstand your question... What you actually want is not a regex search, it's string manipulation...

ADD REPLY
3
Entering edit mode
6.3 years ago

output

$ sed 's/^\w\+[0-9]\+_//' test.txt 
Escherichia_coli 
Bacteroides_fragilis 
Prevotella_marshii 
Prevotella_denticola 
Bacteroides_fragilis 
Enterococcus_faecium 
Enterococcus_faecium

in bash:

$ grep -Po '(?<=[0-9]_).*' test.txt 
Escherichia_coli 
Bacteroides_fragilis 
Prevotella_marshii 
Prevotella_denticola 
Bacteroides_fragilis 
Enterococcus_faecium 
Enterococcus_faecium

input:

$ cat test.txt 
ADWS01000032_Escherichia_coli 
EQ973222_Bacteroides_fragilis 
AEEI01000021_Prevotella_marshii 
AEXO01000076_Prevotella_denticola 
EQ973222_Bacteroides_fragilis 
ACIY01000543_Enterococcus_faecium 
ACIY01000542_Enterococcus_faecium
ADD COMMENT
0
Entering edit mode
6.3 years ago

If all your lines have ID_genus_species only (3 items), then you can probably just use cut:

cut -f 2,3 -d '_' input.file
ADD COMMENT
0
Entering edit mode

sorry, all the entries do not have 3 items; that's why I asked the question... Thanks for thinking about this though.... -- Bert

ADD REPLY
0
Entering edit mode

@cpad0112's solution should work fine.

ADD REPLY
0
Entering edit mode

May be you should post those (entries without 3 items) along with entries in OP bgold04

ADD REPLY

Login before adding your answer.

Traffic: 1907 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6