Hi Folks,
So, I have this header lines...
>CP001830.1_cds_AEH77465.1_1 [locus_tag=SM11_chr0180] [protein_id=AEH77465.1] [location=195246..195674]
>KI271598.1_cds_ERL64443.1_1 [locus_tag=L248_0985] [protein_id=ERL64443.1] [location=complement(53545..53919)]
>CR931997.1_cds_CAI37700.1_1 [locus_tag=jk1527] [db_xref=EnsemblGenomes-Gn:jk1527,EnsemblGenomes-Tr:CAI37700,GOA:Q4JU07,InterPro:IPR001185,UniProtKB/TrEMBL:Q4JU07] [protein_id=CAI37700.1] [location=1801511..1801945]
>HE858529.1_cds_CCI62285.1_1 [locus_tag=SDSE_0788] [db_xref=EnsemblGenomes-Gn:SDSE_0788,EnsemblGenomes-Tr:CCI62285,GOA:K4Q7R5,InterPro:IPR001185,InterPro:IPR019823,UniProtKB/TrEMBL:K4Q7R5] [protein_id=CCI62285.1] [location=complement(732360..732734)]
In some lines I have the information "[db_xref=Ensemb...]" , which I want to remove it.
I can not remove everything after this information (e.g. using "sed"), because I need the remaining the line. I tried to used awk or sed. Also, I can not "cut" or print [awk] according to the column because they are not in all lines.
So, it should be better a script using a regular expression - I guess.
However, I cannot figure out... Could you please help?
What is unclear after reading the documentation?
Regular expression posted by @JC below should work with
sed -r
.I don't see why
sed
can't do this? E.g.,sed -e 's/\[db_xref=Ensemb[^]]*\]//g'
For me, it does not work.