Question

How to remove a substring from each line using python 3

0

Entering edit mode

5.4 years ago

geizetomazetto • 0

Hi Folks,

So, I have this header lines...

>CP001830.1_cds_AEH77465.1_1 [locus_tag=SM11_chr0180]  [protein_id=AEH77465.1] [location=195246..195674]
>KI271598.1_cds_ERL64443.1_1  [locus_tag=L248_0985]  [protein_id=ERL64443.1] [location=complement(53545..53919)]
>CR931997.1_cds_CAI37700.1_1 [locus_tag=jk1527] [db_xref=EnsemblGenomes-Gn:jk1527,EnsemblGenomes-Tr:CAI37700,GOA:Q4JU07,InterPro:IPR001185,UniProtKB/TrEMBL:Q4JU07] [protein_id=CAI37700.1] [location=1801511..1801945]
>HE858529.1_cds_CCI62285.1_1 [locus_tag=SDSE_0788] [db_xref=EnsemblGenomes-Gn:SDSE_0788,EnsemblGenomes-Tr:CCI62285,GOA:K4Q7R5,InterPro:IPR001185,InterPro:IPR019823,UniProtKB/TrEMBL:K4Q7R5] [protein_id=CCI62285.1] [location=complement(732360..732734)]

In some lines I have the information "[db_xref=Ensemb...]" , which I want to remove it.

I can not remove everything after this information (e.g. using "sed"), because I need the remaining the line. I tried to used awk or sed. Also, I can not "cut" or print [awk] according to the column because they are not in all lines.

So, it should be better a script using a regular expression - I guess.

However, I cannot figure out... Could you please help?

sequencing • 1.3k views

ADD COMMENT • link updated 5.1 years ago by Wayne ★ 2.1k • written 5.4 years ago by geizetomazetto • 0

0

Entering edit mode

What is unclear after reading the documentation?

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Regular expression posted by @JC below should work with sed -r.

ADD REPLY • link 5.4 years ago by GenoMax 147k

0

Entering edit mode

I don't see why sed can't do this? E.g.,

sed -e 's/\[db_xref=Ensemb[^]]*\]//g'

ADD REPLY • link 5.4 years ago by mmfansler ▴ 460

0

Entering edit mode

For me, it does not work.

ADD REPLY • link 5.4 years ago by geizetomazetto • 0

score 2 · Accepted Answer · 2019-06-13

2

Entering edit mode

5.4 years ago

JC 13k

Perl:

perl -pe 's/\[db_xref=Ensembl.+?\]//g' < input > output

ADD COMMENT • link 5.4 years ago by JC 13k

0

Entering edit mode

Hi JC,

Thanks a lot. Save my day.

ADD REPLY • link 5.4 years ago by geizetomazetto • 0

score 2 · Accepted Answer · 2019-10-09

Python 3:
In case someone ends up here given the 'Python 3' portion of the OP's question:

Without a regular expression:

output = ""
for line in input:
    if "[db_xref=Ensembl" in line:
        split_on_tag = line.split("[db_xref=Ensembl")
        output += split_on_tag[0] + split_on_tag[1].split("]",1)[1]
    else:
        output += line

With regular expressions:

output = ""
for line in input:
    output += re.sub("\[db_xref=Ensembl.+?\]","",line)

Static view with full run through displayed.

Run and edit the code actively in your browser via MyBinder.org here.