Hello,
Here is an awk
command that will do it for you per chromosome, tinkuhim007:
cat test
Chr1 A 10
Chr1 B 13
Chr1 C 12
Chr2 D 12
Chr2 E 14
Chr2 F 11
awk '{arr[$1","$2]=$3} END { \
for (char1 in arr) { \
for (char2 in arr) { \
split(char1, charArr1, ",") ;
split(char2, charArr2, ",");
if ((char1 != char2) && (charArr1[1] == charArr2[1])) { \
print charArr1[1]"\t"charArr1[2]"\t"charArr2[2]"\t"arr[char2]-arr[char1]}}}}' test
Chr1 A B 3
Chr1 A C 2
Chr1 B A -3
Chr1 B C -1
Chr1 C A -2
Chr1 C B 1
Chr2 D E 2
Chr2 D F -1
Chr2 E D -2
Chr2 E F -3
Chr2 F D 1
Chr2 F E 3
You didn't indicate a rule for order of subtraction. If you don't want negative values anywhere in output, you can just add an extra if
statement:
awk '{arr[$1","$2]=$3} END { \
for (char1 in arr) { \
for (char2 in arr) { \
split(char1, charArr1, ",") ;
split(char2, charArr2, ",");
if ((char1 != char2) && (charArr1[1] == charArr2[1])) { \
result = arr[char2]-arr[char1] ;
if (result > 0) { \
print charArr1[1]"\t"charArr1[2]"\t"charArr2[2]"\t"result}}}}}' test
Chr1 A B 3
Chr1 A C 2
Chr1 C B 1
Chr2 D E 2
Chr2 F D 1
Chr2 F E 3
If you want to understand how this is working, then let me know.
Kevin
Your output assumes each member of column 2 is unique to the chromosome. I'm assuming column 2 has genes and column 3 has some sort of coordinates and you're looking to list intergenic distances of some sort?
Be careful what you wish for, because you may get it: the number of differences you want is given by the formula
Where
n
is the number of elements ( A, B, C, ... ) per chromosome, andk = 2
. For 10 elements on a chromosome, you get 45 differences, for 100 elements per chromosome, you will get 4950 differences, and so on. My cell phone calculator cannot display the results for 1000 elements, as the number is already too big for it.It's
499500
combinations for 1000 elements [C(n,2)
]. I think we have ourselves an XY problem here.