On CBioInfCpp.h as a C++ lib containing some functions for bioinformatics
Dear Sirs.
Though I am not a professional programmer, bionformatics is very interesting interdisciplinary field for me.
I see it, the Python is a "standart language" in this field.
But when I solved problems at rosalind info, I used C++. So as a result a "lib of some function" has been borned.
The lib contains 3 groups of functions. The first one - input-output ones (in order to read-write vectors, matrixes, graphs from-to a file via only one commsnd as it is in Python).
The second group is "Working with strings". Contains some functions from computing GC-content, Edit Distance etc to finding all mutated strings in a given one.
The third is "Working with graphs". A data structure "Adjacency vector" is suggested. By the way, in general case, vertices may have negative integers assigned and graphs may have multiple loops and edges.
Some function such as Eulerian Cycle, Path finding, topological sorting etc are implemented.
May it be useful for some tasks?
I understand that this lib haven't a great majority of features. For example it is not able now to work with bioinformatic databases, but here I can not to implement it by myself only.
Free distributed source code and info is here:
https://drive.google.com/open?id=1FQwsQm2kG_nTO45ab0yj52xtp6_B4IB2
and here: https://github.com/chernouhov/CBioInfCpp-0-
My profile at Rosalind info
http://rosalind.info/users/chernouhov/
Best regards, Chernouhov Sergey
23/06/2019 update:
- Group of function "FindIn" has been updated.
- Functions PairVectorCout, PairVectorFout has been updated.
- Group of function "GraphCout" and "GraphFout" has been added. So nowadays one may "cout/ fout" a graph that is set by Adjacency vector to screen/ to file line by line: one edge in one line.
- Function "StrToCircular" added for finding the circular string of minimal length of the given one.
- Group of function MaxFlowGraph" has been added to help find Maximal Flow, the paths of the maximal flow network and max-flow min-cut in a graph.
- A data structure "Adjacency map" (a modification of data structure for containing graphs "Adjacency vector") has been added. Adjacency map allows to have quicker access to edge’s weight, but it can’t work with multiple edges.
- Functions for converting Adjacency vector to Adjacency map and conversely AdjVectorToAdjMap and AdjMapToAdjVector have been added. Note that Multiple edges will be joined together.
- Function TandemRepeatsFinding has been added. It is intended for finding tandem repeats in the given string that may be useful for solving problems related to Microsatellite Instability etc.
14.07.2019 update:
- Function CIGAR1 has been added.
- Group of function "GraphCout" and "GraphFout" has been updated (so nowadays one may "cout/ fout" a graph that is set by both Adjacency vector and Adjacency map to screen/ to file line by line: one edge in one line).
- Function EditDistA as an extended version of the function EditDist has been added (returns not only the value of Edit Distance between 2 strings but also one possible version of the alignment itself).
09.08.2019 update:
- Group of function "NBPaths" (for finding maximal branching paths in a graph, both weighted or no, direcyed or no) has been added.
- Functions ConsStringQ1 and ConsStringQ2 for building consensus string upon a given collection of strings according to their quality has been added. Note that due to little data for testing errors may be found here (please notify if you found any).
31.08.2019 update:
- Function GenRandomUWGraph that generates a random unweighted graph (as its "Adjacency vector") has been added.
- Group of function intended to find collection of vertices for each strongly connected component of directed graph and to find collection of vertices for each connected component of undirected graph has been added.
- Group of function for counting edges multiplicity of a graph that is set by Adjacency vector has been added.
19.10.2019:
- Added group of functions AdjVectorToAdjMegaMap, AdjMegaMapToAdjVector to convert Adjacency vector to/ from Adjacency mega-map (i.e. extended version of Adjacency map to contain graphs having different multiply edges).
- Updated Group of function GraphCout and GraphFout to deal with mega-maps.
03.11.2019
- Group of functions Num updated.
- Function ScoreStringMatrix that counts score (i.e. total number of mismatches) upon vector a of strings s added.
- Function GPPM that generates a position probability matrix (PPM) added. Note that pseudocounts may be used (the formula (Ns+z)/(N+2*z) is implemented).
26.11.2019
For further updates please see here: A: CBioInfCpp.h as a C++ lib containing some functions for bioinformatics
For eliminating bottlenecks you might want to look into scipy/weave, since it enables you to embed C/C++ code directly into your python scripts. See http://www.scipy.org/PerformancePython for details.
I'm assuming you know your stuff and have tried this, but whenever performance comes up, it's worth mentioning that profiling to identify bottlenecks can be very useful. If there is an obvious rate-limiting step, you may be able to extend python and just write a C function for that one small piece of the puzzle, rather than switching wholesale to C.
If you need to parse some structured data lex/yacc is your friend. for XML use libxml2, for ASN.1 use the NCBI asntool.
Thanks Chris, the originator of this question has been pointed to this thread - I'm sure they appreciate the comments about identifying bottlenecks as well.