I have the following question. I have many files each holding an expression value for a gene. there might be a missing value, in that case I want to skip the sequence
I need to combine them to one matrix. The gene ID is in the second column of each file.
Example files are:
WT_condition1 2 AAAAAAAAAAAAAAAAAA 3 AAAAAAAAAAAAAAAAAAA 9 AAAAAAAAAAAAAAAAAAAA 1 AAAAAAAAAAAAAAAAAAAAA 10 AAAAAAAAAAAAAAAAAAAAAA 3 AAAAAAAAAAAAAAAAAAAAAAA
WT_condition2 1 AAAAAAAAAAAAAAAAAA 4 AAAAAAAAAAAAAAAAAAA 10 AAAAAAAAAAAAAAAAAAAA 111 AAAAAAAAAAAAAAAAAAAAA 11 AAAAAAAAAAAAAAAAAAAAAA 14 AAAAAAAAAAAAAAAAAAAAAAA
WT_condition3 12 AAAAAAAAAAAAAAAAAA 40 AAAAAAAAAAAAAAAAAAA 11 AAAAAAAAAAAAAAAAAAAA 32 AAAAAAAAAAAAAAAAAAAAA 2 AAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAA
Highly appreciate your help
etc
It's unclear to me which format you have and which format you want, could you explain again?
I send again the format of the files and the desired output. Columns are separated by tab and there are some missing values that should be replaced by 0. Thank you again for your help. I have double space because for what ever reason I can not display them. In the actual file lines have no double space
file1
2 AAAAAAAAAAAAAAAAAA
3 AAAAAAAAAAAAAAAAAAA
9 AAAAAAAAAAAAAAAAAAAA
1 AAAAAAAAAAAAAAAAAAAAA
10 AAAAAAAAAAAAAAAAAAAAAA
3 AAAAAAAAAAAAAAAAAAAAAAA
file2
10 AAAAAAAAAAAAAAAAAA
11 AAAAAAAAAAAAAAAAAAA
23 AAAAAAAAAAAAAAAAAAAA
14 AAAAAAAAAAAAAAAAAAAAA
22 AAAAAAAAAAAAAAAAAAAAAA
30 AAAAAAAAAAAAAAAAAAAAAAA
file3
30 AAAAAAAAAAAAAAAAAA
15 AAAAAAAAAAAAAAAAAAAA
24 AAAAAAAAAAAAAAAAAAAAA
42 AAAAAAAAAAAAAAAAAAAAAA
29 AAAAAAAAAAAAAAAAAAAAAAA
file4
33 AAAAAAAAAAAAAAAAAAA
90 AAAAAAAAAAAAAAAAAAAA
11 AAAAAAAAAAAAAAAAAAAAA
8 AAAAAAAAAAAAAAAAAAAAAA
2 AAAAAAAAAAAAAAAAAAAAAAA
output
ID file1 file2 file3 file4
AAAAAAAAAAAAAAAAAA 2 10 30 0
AAAAAAAAAAAAAAAAAAA 3 11 0 33
AAAAAAAAAAAAAAAAAAAA 9 23 15 90
AAAAAAAAAAAAAAAAAAAAA 1 14 24 11
AAAAAAAAAAAAAAAAAAAAAA 10 22 42 8
AAAAAAAAAAAAAAAAAAAAAAA 3 30 29 2
Do you know a bit of python or R? Would be quite pretty damn straightforward ;-) (Definitely also in other languages but I can't help you with those.)
For example in python you want to have a dictionary per file in which you can then add each line as key = gene id and value = count. After doing this for each file you'd create the union of all keys, iterate over those and write out either the value for each file or a 0 if not present. In R you'd have a few dataframes on which you can perform a join on those and fill in blancs/NAs with 0's.
So do you have some experience and think you can figure this out? You can always ask for help if you're stuck. The reward is far higher if you solve this yourself, and you'll probably learn some interesting stuff. The best way of learning something is to suck repeatedly with it and get better by failure. As we say in our lab: success comes from going from failure to failure without losing enthusiasm. If you prefer a complete script I can write that, too, but I'm afraid not today or tomorrow.
This is a bit what I was thinking. If you can NOT do that on your own (=straight forward).... what do you want to do with your results? How do you want to analyse it? So even if Wouter would be so nice to do the script for you... I think you should start trying to do that on your own and learn how to do this by yourself - or you will end up asking somebody for another script right after Wouter did post his script - and so on and so on.....and this is not going to get you nowhere anytime soon. If you really want to work with this kind of data, learn at least the basics how to handle this data.