I would like to merge two files by based on one column (rs number). Not all rs numbers in file 1 are in file 2, and not all in file 2 are in file 1. However, I wanna keep ALL lines from file 1 in the output.
File 1 (example):
rs chr bp p
rs1 chr1 729679 0.9456
rs2 chr1 731718 0.7125
rs3 chr1 734349 0.7582
rs4 chr1 736289 0.7807
rs5 chr1 748878 0.3864
File 2 (example):
RSID CODING UTR PROMOTER DHS INTRON
rs2 0 0 0 0 1
rs3 0 0 0 0 1
rs1 0 0 1 1 0
rs4 0 0 0 1 0
rs6 0 1 0 1 0
In the case that an rs number is not found in file 2, I want to give this rs number the annotation: 0 0 0 0 0. so, hypothetically... based on my examples, this should be my output
Output (example):
rs chr bp p CODING UTR PROMOTER DHS INTRON
rs1 chr1 729679 0.9456 0 0 1 1 0
rs2 chr1 731718 0.7125 0 0 0 0 1
rs3 chr1 734349 0.7582 0 0 0 0 1
rs4 chr1 736289 0.7807 0 0 0 1 0
rs5 chr1 748878 0.3864 0 0 0 0 0
Merging files is fairly simple with a function like "join" but I have no idea how to keep the lines in file 1 without a column match with file 2, and in such a case, add the 0 0 0 0 0 annotation.
I really hope my question is clear, and any help is appreciated.
I stumbled over a join command without the need of an extra awk call:
It was said that the
-e
(Empty) option needs-o
(output sequence). The 0 in-o
is the joining field, followed by the file number and its column. Thus if file number 2 has a missing entry, the-e
argument is put on each field defined in-o 2.X
.cool, I didn't know this option
-o