Complex string filtering with python for gene tree phylogeny
1
0
Entering edit mode
4.6 years ago
mdgn ▴ 10

I have a long string that is a phylogenetic gene tree and I want to do a very specific filtering.

(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;

Basically every x@y is a species@gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x@y.

(Esy,Aar,(Spa,Cpl))...

I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x@y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.

Main problem is there is no solid 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an @ character.

Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.

python trimming • 785 views
ADD COMMENT
1
Entering edit mode

You should be able to use regex to replace all (globally) @[^,)]+ with '' (blank string) removing these strings.

ADD REPLY
2
Entering edit mode
4.6 years ago
JC 13k
$ echo "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;" | perl -pe "s/\@.+?:/:/g"

(Esy:0.0726396855636,Aar:0.137507902808,((Spa:0.0318934795022,Cpl:0.0273465005242):9.05326020871e-05,(((Bst:0.0332592496158,((Aly:0.0328569260951,Ath:0.0391706378372):0.0205924636564,(Chi:0.0954469923893,Cru:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco:0.00823215335663,Hlo:0.0085462978729):0.0144626717872,Hla:0.0225079453622):0.0206478928557,Hse:0.048590776459):0.0372829371381):0.00859075940423,(Esa:0.0378509854703,Aal:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;

Explanation: perl -pe "s/\@.+?:/:/g" :

  • perl -pe "code" -> Executes perl code in all lines passes, printing everything after operation
  • s/pattern/change/g -> look for a pattern and substitutes all matches (g)
  • \@.+?: -> this is the pattern, it looks for anything that starts with "@", followed by anything "." but expanded until it gets a ":", if match it will change to ":"
ADD COMMENT
1
Entering edit mode

This does not match OP's requirement. You don't need to retain the part after the :, so you can essentially eliminate everything after (and including @ until the next , or ).

ADD REPLY
1
Entering edit mode

Then we can do:

$ echo "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;" | perl -pe "s/\@.+?://g; s/:*\d+\.\d+//g; s/e-\d+//g"

(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))));
ADD REPLY
0
Entering edit mode

Although you are correct, I believe I can solve it after this step, I would still say this is a helpful answer. It is easier for me now to just remove anything in between : and , . I need to improve my regex skills.

ADD REPLY

Login before adding your answer.

Traffic: 1768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6