How To Separate Gene Ontology From Microarray Data When There Aren'T Separators?
4
1
Entering edit mode
12.7 years ago
Jason ▴ 940

I'm trying to compare a published data source to my own however the data provided in the paper's supplement is in a very inconvenient format. There is definitely a pattern e.g. systematic name, common name, function, then five separate expression values. The problem is that they are not separated by a distinguishing nomenclature like commas or semicolons and there's no pattern as to every other line being expression etc. My question is how can I turn this stream of data into a file table which has separators or a table? I'm familiar with python, unix and R so if you could give me recommendations using those languages/programs I'd appreciate it. I was thinking of trying to extract the first word and last five numbers of a the pattern, but I'm a little lost as to how to go about doing that since it's just a stream of data. Thanks

Here is a sample of the data:

YLL053C unknown "unknown; similar to putative aquaporin Ypr192p, member of"
6.58901734 8.105915084 6.149733501 5.380500555 6.55629162
YOL058W ARG1 arginine biosynthesis arginosuccinate synthetase 5.83089555
5.654319063 3.215216985 3.408089094 4.527130173
YKL096W CWP1 cell wall protein "beta1,6glucan
acceptor" 3.819486035
3.63787768 5.394170324 5.055785352 4.476829848

Here is what I want (I only really need the genes and expression values):

YLL053C    6.58901734    8.105915084  6.149733501 5.380500555  6.55629162
YOL058W  5.83089555    5.654319063  3.215216985  3.408089094  4.527130173
YKL096W   3.819486035  3.63787768   5.394170324  5.055785352  4.476829848

or

YLL053C; 6.58901734;    8.105915084;  6.149733501; 5.380500555;  6.55629162
YOL058W; 5.83089555;    5.654319063;  3.215216985;  3.408089094;  4.527130173
YKL096W;  3.819486035;  3.63787768;   5.394170324;  5.055785352;  4.476829848
microarray data analysis • 3.3k views
ADD COMMENT
0
Entering edit mode

You only show 3 lines here, but it looks like the ID you want is always in the first column and expression values are always the last columns. In this case a simple regular expression should do the trick.

ADD REPLY
0
Entering edit mode

Right, there is a pattern, but I'm unsure what kind of "simple regular expression" would work.

ADD REPLY
0
Entering edit mode

Something like (S+)s+.+s+([0-9].[0-9]+)s+([0-9].[0-9]+)s+([0-9].[0-9]+)s+([0-9].[0-9]+)s+([0-9].[0-9]+)

ADD REPLY
0
Entering edit mode

i believe you would encounter errors with this if there were numbers in the gene description, best to add a $ to the end of that so it matches the last 5, i believe this might also be memory efficient than splitting first but in this case the difference should be negligible

ADD REPLY
2
Entering edit mode
12.7 years ago

I'm not 100% sure about your file format but if all data are on one row, then this should get you what you need:

awk -F" " '{print $1, $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF}' filename

where you replace filename with your file. This says split your file on spaces and print the first and 4th from last to last columns, where NF symbolizes the number of columns in your file.

ADD COMMENT
0
Entering edit mode

Thanks, I didn't realize you could do that with awk, but that didn't really do anything for me. When I use that script I get:

YLL053C putative aquaporin Ypr192p, member of"
6.58901734 6.58901734 8.105915084 6.149733501 5.380500555 6.55629162
YOL058W arginine biosynthesis arginosuccinate synthetase 5.83089555
5.654319063 5.654319063 3.215216985 3.408089094 4.527130173 5.654319063 3.215216985 3.408089094 4.527130173
YKL096W CWP1 cell wall protein "beta1,6glucan
acceptor" awk: trying to access out of range field -2
input record number 6, file sample_table.txt
source line number 1

I'm not sure what this means. Maybe there isn't just one row?

ADD REPLY
0
Entering edit mode

can you go and check if it is a space or a tab between the identifier (YLL053C) and the rest of the text? Ideally, it should be organized

<Identifier> <GO> <expr1> <expr2> <expr..>

separated by tabs (since the go part can be separated by spaces. Then use Awk/cut to get the 1st and expr columns. If they are not separated by tabs and instead separated by spaces, I would grab the 1st item (always ID) and then grab the last 5 items (exprs). If you know how to use awk or perl it should not take too long :)

ADD REPLY
0
Entering edit mode

Can you post your file somewhere, since I think the whitespace formatting is not being properly displayed here. I assumed it was all space delimited.

ADD REPLY
0
Entering edit mode

upvote for -F - i have been using 'BEGIN {FS=" "}

ADD REPLY
0
Entering edit mode
12.7 years ago
Qdjm 1.9k

The number seems very consistent, I bet you could scan for [0-9].[0-9]+ to uniquely match them. Also, looks like there's a gene name that follows every five numbers. So, after you grab the fifth number, you can count on the next token being the gene name.

So here's the pseudo-code:

i = 0
do 
  GeneName[i] = next string token
  for j = 1 to 5
    ExpressionValue[i,j] = next token matching "[0-9].[0-9]+"
  end
  i = i+1
while(~at_file_end)
ADD COMMENT
0
Entering edit mode
12.7 years ago
Georg Summer ▴ 140

from the sample a " is the last thing before the numbers (expression).

string line;
while(readline(in,line)) {
  int end_of_text = line.rfind("\"",line.length())

  text = line.substr(0,end_of_text);
  expression = line.substr(end_of_text+1,line.length());

  // depending on what you want know:
  // substitute all blanks with \t or ;

  out << text << expression << endl;

}
ADD COMMENT
0
Entering edit mode

All of the numbers are expression values, there are five total. Is this in python?

ADD REPLY
0
Entering edit mode

i believe that is c

ADD REPLY
0
Entering edit mode
12.7 years ago
Ashwin ▴ 110

cat test.txt | sed -e 's/\(.\{,7\}\).*[a-zA-Z\"]\s\([0-9]\.[0-9].*\)/\1\t\2/' | sed 's/\s/\t/g' > out.txt

ADD COMMENT
0
Entering edit mode

I got the following error when I tried your sed code:

sed: 1: "s/(.{,7}).*[a-zA-Z ...": RE error: invalid repetition count(s)

I'm not very familiar with sed, could you break down what you are trying to do?

ADD REPLY
0
Entering edit mode

"(,{,7})" = This will capture identifier (upto first 7 characters) : \1 in "\1\t\2"

".*[a-zA-Z"]s" = matches to any length ending with any character + space

"([0-9].[0-9].*)" = captures all expression values : \2 in "\1\t\2"

Basically ( ) captures matching regex which can be used later. Can you try replacing bounding single quotes( ' ) to double quotes ( " ) Say, sed -e "s/expression//" . With single quotes it worked on my machine. On the second note, I believe file you trying to transform doesn't follow the format uniformly. There are identifiers with empty expression values , which I guess you can ignore safely.

ADD REPLY

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6