Question

Complex string filtering with python for gene tree phylogeny

0

Entering edit mode

4.6 years ago

mdgn ▴ 10

I have a long string that is a phylogenetic gene tree and I want to do a very specific filtering.

(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;

Basically every x@y is a species@gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x@y.

(Esy,Aar,(Spa,Cpl))...

I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x@y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.

Main problem is there is no solid 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an @ character.

Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.

python trimming • 785 views

ADD COMMENT • link updated 4.6 years ago by JC 13k • written 4.6 years ago by mdgn ▴ 10

1

Entering edit mode

You should be able to use regex to replace all (globally) @[^,)]+ with '' (blank string) removing these strings.

ADD REPLY • link 4.6 years ago by Ram 44k

score 2 · Accepted Answer · 2020-04-27

$ echo "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;" | perl -pe "s/\@.+?:/:/g"

(Esy:0.0726396855636,Aar:0.137507902808,((Spa:0.0318934795022,Cpl:0.0273465005242):9.05326020871e-05,(((Bst:0.0332592496158,((Aly:0.0328569260951,Ath:0.0391706378372):0.0205924636564,(Chi:0.0954469923893,Cru:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco:0.00823215335663,Hlo:0.0085462978729):0.0144626717872,Hla:0.0225079453622):0.0206478928557,Hse:0.048590776459):0.0372829371381):0.00859075940423,(Esa:0.0378509854703,Aal:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;

Explanation: perl -pe "s/\@.+?:/:/g" :

perl -pe "code" -> Executes perl code in all lines passes, printing everything after operation
s/pattern/change/g -> look for a pattern and substitutes all matches (g)
\@.+?: -> this is the pattern, it looks for anything that starts with "@", followed by anything "." but expanded until it gets a ":", if match it will change to ":"