I have a long string that is a phylogenetic gene tree and I want to do a very specific filtering.
(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
Basically every x@y
is a species@gene_id
information. What I am trying to do is trimming this down so that I will only have x
instead of x@y
.
(Esy,Aar,(Spa,Cpl))...
I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x@y
is ending with a ,
and others with a )
. I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip()
but it seems like I need to specify the characters to be stripped for this.
Main problem is there is no solid 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an @
character.
Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.
You should be able to use regex to replace all (globally)
@[^,)]+
with '' (blank string) removing these strings.