I think this might work, but it's a sort of 'brute force' way to do it. I would maybe re-factor your trees to cladograms and remove the branch lengths via a regex for the branch length and colon (in whatever your favourite regex language is), then you could simply grep
or string search in some other manner for (Spec3,Spec2)
and you'll find all trees which contain that grouping pretty easily.
e.g.: Remove decimals, sole zeros and colons from the file (probably not the most elegant regex):
Given your tree:
((Spec4:0.529207,(Spec3:0.0803395,Spec2:0.0124315)),Spec1:0,Spec1:0);
One could do:
cat test.tree | sed -e 's/[0-9]*\.[0-9]*//g' -e 's/0//g' -e 's/://g'
Yeilding:
((Spec4,(Spec3,Spec2)),Spec1,Spec1);
Then you can string search your yielded trees:
egrep -r -l "Spec(2|3),Spec(2|3)" .
Will give you all the filenames where Species 3 and Species 2 are adjacent nodes (in either orientation).
If you want to keep branch length in your trees as you're not just interested in topology, you could concoct a regex for use with grep
:
egrep "Spec(2|3):(0?|[0-9]+\.[0-9]+),Spec(2|3):(0?|[0-9]+\.[0-9]+)" treefile.tree
But having to conjure that regex for every possible combination of topologies looks awful to me, so I'd be inclined to try it without the branch lengths.
I don't know how many topologies you're interested in finding in all your trees - this approach may not be feasible if it's a prohibitively large number.
Slightly more complex, if you'd like to see the match, and the file name, this is an option:
2 example sed-treated trees:
((Spec4,(Spec5,Spec6)),Spec2,Spec3);
((Spec4,(Spec3,Spec2)),Spec1,Spec1);
Passing a 'dummy filename' in the form of dev/null
tricks grep in to printing the filename (as it thinks it's working on multiple files) and the actual match itself by default:
for file in *.tree ; do egrep "Spec(2|3),Spec(2|3)" "$file" /dev/null ; done
Would yeild:
sed2.tree:((Spec4,(Spec5,Spec6)),Spec2,Spec3);
sed.tree:((Spec4,(Spec3,Spec2)),Spec1,Spec1);
With the appropriate string matches highlighted (if your terminal is configured for it).
Not aware of tool to subset trees based on topology. Yes, a script/regex could help.
I am wondering if you have the images of the trees? If you do, may be it's interesting to try deep learning / computer vision-based approach here?
Dear Khader Shameer, atm i don't have the images of the trees (but could get them). Thanks for your reply.