Question

How to join two txt files in unix

0

Entering edit mode

4.8 years ago

xiaoyonf ▴ 60

Hi, I have a 1000 txt files with two columns: the gene symbol column, and the mutation status column. I want to join all of these files into one file, which will contain first gene symbol column and the following 1000 sample columns of mutation status. For example, I want to join the following input files:

txt file 1:

Gene Sample
A        ID1
B        ID1
D        ID1

txt file 2:

Gene Sample
B         ID2
C         ID2
E         ID2

txt file 3, ... txt file 1000

into the output file

Gene      ID1      ID2    ID3 ... ID1000
A          yes      NA      ...
B          yes      yes     ...
C          NA       yes     ...
D          yes      NA      ...
E          NA       yes     ...
...

I know the full_join solution in R using the dplyr package, but it need to read all the files into R. Does anyone has the simple solution in Unix to do this?

Thanks a lot! Xiaoyong

gene snp R genome • 1.1k views

ADD COMMENT • link 4.8 years ago by xiaoyonf ▴ 60

score 3 · Answer 1 · 2020-10-11

3

Entering edit mode

4.8 years ago

Pierre Lindenbaum 166k

convert your files to a format GENE/SAMPLE/VALUE

 awk '($1=="Gene"){SN=$2;next;} {printf("%s\t%s\t%s\n",$1,SN,$2);}'  input*

and pipe the output in datamash groupby

ADD COMMENT • link 4.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you so much, Pierre. I will appreciate if you can explain me a more detail of the code and how to pipe out using datamash. It will be very helpful for me. Thanks.

ADD REPLY • link 4.8 years ago by xiaoyonf ▴ 60

0

Entering edit mode

Hi Pierre,

I really appreciate your response. I have modified my question to make it more precise. I haven't tried your solution yet, but I am afraid that it may need modified too. Thanks!

ADD REPLY • link 4.8 years ago by xiaoyonf ▴ 60