I have a big quantity of recorded data (perhaps hundreds of thousands of records) that I need to be able to break down so that I can both classify it and make "typical" data myself. Allow me to elaborate...
If I have the following data strings:
142T339G1P112S
164T797F5A498S
144T989B9B223T
155T928X9Z554T
... You may begin to deduce the following:
The fourth, eighth, tenth, and fourteenth characters may always be alphas, while the rest are numeric the first character may always be a '1' the fourth character may always be the letter 'T' the fourteenth character may be confined to just being 'S' or 'T' and so on...
Some of these "rules" may evaporate when additional samples of real data are obtained; if you see a 15 character long string, you have proof that the first "rule" is erroneous. However, if you have a sufficiently large sample of strings that are exactly 14 characters long, you can begin to assume that "all strings are 14 characters long" and assign a numerical figure to your degree of confidence (with an appropriate set of assumptions based on the fact that you're seeing a suitably random set of all possible captured data). As you might expect, a person can accomplish a lot of this classification by sight, but I'm not aware of any libraries or methods that would enable a machine to do it.
Is there a library that I can use in my code to accomplish this type of categorization for me, identifying "rules" with a specific degree of confidence, given a collection of collected data (much more complicated than the above...)?
At an estimate, according to this article Python or Java (or maybe Perl or R) are the "common" languages most likely to include these kinds of tools, and perhaps certain bioinformatic libraries might as well. I don't care what language I have to use; I just need to tackle the problem in whatever manner I can.
Any kind of information referral would be quite helpful. As you can probably guess, I'm having trouble describing the situation accurately, and there may be a set of relevant phrases I can enter into Google to send me in the right direction.