I'm trying to compare a published data source to my own however the data provided in the paper's supplement is in a very inconvenient format. There is definitely a pattern e.g. systematic name, common name, function, then five separate expression values. The problem is that they are not separated by a distinguishing nomenclature like commas or semicolons and there's no pattern as to every other line being expression etc. My question is how can I turn this stream of data into a file table which has separators or a table? I'm familiar with python, unix and R so if you could give me recommendations using those languages/programs I'd appreciate it. I was thinking of trying to extract the first word and last five numbers of a the pattern, but I'm a little lost as to how to go about doing that since it's just a stream of data. Thanks
Here is a sample of the data:
YLL053C unknown "unknown; similar to putative aquaporin Ypr192p, member of"
6.58901734 8.105915084 6.149733501 5.380500555 6.55629162
YOL058W ARG1 arginine biosynthesis arginosuccinate synthetase 5.83089555
5.654319063 3.215216985 3.408089094 4.527130173
YKL096W CWP1 cell wall protein "beta1,6glucan
acceptor" 3.819486035
3.63787768 5.394170324 5.055785352 4.476829848
Here is what I want (I only really need the genes and expression values):
YLL053C 6.58901734 8.105915084 6.149733501 5.380500555 6.55629162
YOL058W 5.83089555 5.654319063 3.215216985 3.408089094 4.527130173
YKL096W 3.819486035 3.63787768 5.394170324 5.055785352 4.476829848
or
YLL053C; 6.58901734; 8.105915084; 6.149733501; 5.380500555; 6.55629162
YOL058W; 5.83089555; 5.654319063; 3.215216985; 3.408089094; 4.527130173
YKL096W; 3.819486035; 3.63787768; 5.394170324; 5.055785352; 4.476829848
You only show 3 lines here, but it looks like the ID you want is always in the first column and expression values are always the last columns. In this case a simple regular expression should do the trick.
Right, there is a pattern, but I'm unsure what kind of "simple regular expression" would work.
Something like (S+)s+.+s+([0-9].[0-9]+)s+([0-9].[0-9]+)s+([0-9].[0-9]+)s+([0-9].[0-9]+)s+([0-9].[0-9]+)
i believe you would encounter errors with this if there were numbers in the gene description, best to add a $ to the end of that so it matches the last 5, i believe this might also be memory efficient than splitting first but in this case the difference should be negligible