compare many files based on ID
0
0
Entering edit mode
8.5 years ago
Chris ▴ 30

I have the following question. I have many files each holding an expression value for a gene. there might be a missing value, in that case I want to skip the sequence

I need to combine them to one matrix. The gene ID is in the second column of each file.

Example files are:

WT_condition1 2 AAAAAAAAAAAAAAAAAA 3 AAAAAAAAAAAAAAAAAAA 9 AAAAAAAAAAAAAAAAAAAA 1 AAAAAAAAAAAAAAAAAAAAA 10 AAAAAAAAAAAAAAAAAAAAAA 3 AAAAAAAAAAAAAAAAAAAAAAA

WT_condition2 1 AAAAAAAAAAAAAAAAAA 4 AAAAAAAAAAAAAAAAAAA 10 AAAAAAAAAAAAAAAAAAAA 111 AAAAAAAAAAAAAAAAAAAAA 11 AAAAAAAAAAAAAAAAAAAAAA 14 AAAAAAAAAAAAAAAAAAAAAAA

WT_condition3 12 AAAAAAAAAAAAAAAAAA 40 AAAAAAAAAAAAAAAAAAA 11 AAAAAAAAAAAAAAAAAAAA 32 AAAAAAAAAAAAAAAAAAAAA 2 AAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAA

Highly appreciate your help

etc

RNA-Seq • 1.4k views
ADD COMMENT
0
Entering edit mode

It's unclear to me which format you have and which format you want, could you explain again?

ADD REPLY
0
Entering edit mode

I send again the format of the files and the desired output. Columns are separated by tab and there are some missing values that should be replaced by 0. Thank you again for your help. I have double space because for what ever reason I can not display them. In the actual file lines have no double space

file1

2 AAAAAAAAAAAAAAAAAA

3 AAAAAAAAAAAAAAAAAAA

9 AAAAAAAAAAAAAAAAAAAA

1 AAAAAAAAAAAAAAAAAAAAA

10 AAAAAAAAAAAAAAAAAAAAAA

3 AAAAAAAAAAAAAAAAAAAAAAA

file2

10 AAAAAAAAAAAAAAAAAA

11 AAAAAAAAAAAAAAAAAAA

23 AAAAAAAAAAAAAAAAAAAA

14 AAAAAAAAAAAAAAAAAAAAA

22 AAAAAAAAAAAAAAAAAAAAAA

30 AAAAAAAAAAAAAAAAAAAAAAA

file3

30 AAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAA

15 AAAAAAAAAAAAAAAAAAAA

24 AAAAAAAAAAAAAAAAAAAAA

42 AAAAAAAAAAAAAAAAAAAAAA

29 AAAAAAAAAAAAAAAAAAAAAAA

file4

AAAAAAAAAAAAAAAAAA

33 AAAAAAAAAAAAAAAAAAA

90 AAAAAAAAAAAAAAAAAAAA

11 AAAAAAAAAAAAAAAAAAAAA

8 AAAAAAAAAAAAAAAAAAAAAA

2 AAAAAAAAAAAAAAAAAAAAAAA

output

ID file1 file2 file3 file4

AAAAAAAAAAAAAAAAAA 2 10 30 0

AAAAAAAAAAAAAAAAAAA 3 11 0 33

AAAAAAAAAAAAAAAAAAAA 9 23 15 90

AAAAAAAAAAAAAAAAAAAAA 1 14 24 11

AAAAAAAAAAAAAAAAAAAAAA 10 22 42 8

AAAAAAAAAAAAAAAAAAAAAAA 3 30 29 2

ADD REPLY
0
Entering edit mode

Do you know a bit of python or R? Would be quite pretty damn straightforward ;-) (Definitely also in other languages but I can't help you with those.)

For example in python you want to have a dictionary per file in which you can then add each line as key = gene id and value = count. After doing this for each file you'd create the union of all keys, iterate over those and write out either the value for each file or a 0 if not present. In R you'd have a few dataframes on which you can perform a join on those and fill in blancs/NAs with 0's.

So do you have some experience and think you can figure this out? You can always ask for help if you're stuck. The reward is far higher if you solve this yourself, and you'll probably learn some interesting stuff. The best way of learning something is to suck repeatedly with it and get better by failure. As we say in our lab: success comes from going from failure to failure without losing enthusiasm. If you prefer a complete script I can write that, too, but I'm afraid not today or tomorrow.

ADD REPLY
0
Entering edit mode

This is a bit what I was thinking. If you can NOT do that on your own (=straight forward).... what do you want to do with your results? How do you want to analyse it? So even if Wouter would be so nice to do the script for you... I think you should start trying to do that on your own and learn how to do this by yourself - or you will end up asking somebody for another script right after Wouter did post his script - and so on and so on.....and this is not going to get you nowhere anytime soon. If you really want to work with this kind of data, learn at least the basics how to handle this data.

ADD REPLY

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6