Question

Merging columns from multiple files and adding headers- Linux

0

Entering edit mode

3.8 years ago

tenten • 0

I have over 50 files that have the following format:

File1.tab

A 25
B 56
C 87
D 82
E 91
F 34

File2.tab

A 89
B 55
C 92
D 36
E 28
F 19

File3.tab

A 98
B 76
C 82
D 76
E 49
F 42

etc.

I would like the following:

The columns merged together based on the first column.
each column to have the file name as a header.
The values separated by a tab not space so when I open the merged file in excel it will have multiple columns (i.e column1= "row1 is empty, row2 is A and so on"; column2= "row1 is File1.tab, row2 is 25 and so on"; column3= "row1 is File2.tab, row2 is 89 and so on"; etc).

final.tab

  File1.tab File2.tab File3.tab
A    25        89        98
B    56        55        76
C    87        92        82
D    82        36        76
E    91        28        49
F    34        19        42

I tried searching for answers and followed the following command: from https://unix.stackexchange.com/questions/638521/merge-multiple-files-by-first-column

Print with headers: for the multiple files version:

$ hdr() { awk 'FNR==1{ print "\0", FILENAME }1' "$1"; }
$ join -a1 -a2 -e 1 -o auto <(hdr file1) <(hdr file2) >join.tmp
$ for file in rest_files*; do
     join -a1 -a2 -e 1 -o auto join.tmp <(hdr "$file") >join.tmp.1
     mv join.tmp.1 join.tmp
  done
$ tr -d '\0' <join.tmp >final.file

When I opened the output in excel, I had the following problems:

The headers were assigned twice in two rows
Columns from the first and second files were added twice and then the files was added again at the end.
All the values and headers were separated by a space and just assigned to one column in excel.

Here is what it looked like:

final.tab

 File1.tab File2.tab File3.tab join.temp
 File1.tab File2.tab File3.tab File1.tab
A 25 89 25 89 98 25
B 56 55 56 55 76 56
C 87 92 87 92 82 87
D 82 36 82 36 76 82
E 91 28 91 28 49 91
F 34 19 34 19 42 34

Can you fix the current code or provide me with a new one that will do I need?

linux merge join • 4.0k views

ADD COMMENT • link updated 3.8 years ago by bioinformatics2020 ▴ 830 • written 3.8 years ago by tenten • 0

score 2 · Answer 1 · 2021-09-01

Try csvtk join:

# add header
for f in *.tab; do
    csvtk add-header -t -n ,$f $f -o $f.tsv;
done

# join
$ csvtk join -t *.tsv
        File1.tab       File2.tab       File3.tab
A       25      89      98
B       56      55      76
C       87      92      82
D       82      36      76
E       91      28      49
F       34      19      42

# to markdown
$ csvtk join -t *.tsv | csvtk csv2md -t

	File1.tab	File2.tab	File3.tab
A	25	89	98
B	56	55	76
C	87	92	82
D	82	36	76
E	91	28	49
F	34	19	42

score 0 · Answer 2 · 2021-09-02

If you don't mind using R (or if you have it installed on your server.) Save the following code in the directory that has all of your .tab files and make sure you give it the .R file extension (my_file.R)

file_names <- list.files(pattern = "\\.tab")
files <- lapply(file_names, function(x) read.table(x, header = FALSE))
files <- suppressWarnings(Reduce(function(x, y) merge(x, y, by = "V1"), files))
colnames(files) <- c("Merge_Col", file_names)

Then, use the Rscript command to execute the file in the terminal using whatever you saved the file as.

Rscript my_file.R