Question

merge multiple files by commom column names

0

Entering edit mode

5.5 years ago

elhamidihay ▴ 30

I have multiple whitespace delimited files with a different number of columns. I would like to join them and keep the common column names.

test1_file

id SNP1 SNP2 SNP3
162 AA BB AA
172 AB BB AB

test2_file

id SNP1 SNP2 SNP3 SNP4
189 BB BB AA AB
1590 AA AB BB AA

test3_file

id SNP8 SNP2 SNP3 SNP9
182 AA BB AA AA
132 BB BB AA BB

desired output file:

id SNP2 SNP3
162 BB AA
172 BB AB
189 BB AA
1590 AB BB
182 BB AA
132 BB AA

my code: Its not working, and i would like to add more files than just two

import pandas as pd

test1_file = pd.read_csv('2014_07_03', delim_whitespace = True)
test2_file = pd.read_csv('2019_08_19', delim_whitespace = True)

f = pd.merge(left=test1_file, right=test2_file , how='left', on='MergeCol')
f.to_csv('outfile.csv')

pandas python • 3.9k views

ADD COMMENT • link updated 5.5 years ago by finswimmer 16k • written 5.5 years ago by elhamidihay ▴ 30

3

Entering edit mode

5.5 years ago

shenwei356 8.7k

A solution using parallel and csvtk:

# converting space-delimited to tab-delimited
l$ s *.txt | parallel 'csvtk space2tab {} > {}.tsv'

# retrieving colnames
$ ls *.txt.tsv | parallel 'csvtk headers -t {} | csvtk cut -t -f 2 > {}.colnames'

# common colnames
$ csvtk inter *.colnames
id
SNP3
SNP2

# cut by common colnames
$ ls *.txt.tsv | parallel 'csvtk cut -t -f $(csvtk inter *.colnames | paste -s -d ,) {} > {}.selected'

# concatenating
$ csvtk concat -t *.selected
id      SNP3    SNP2
162     AA      BB
172     AB      BB
189     AA      BB
1590    BB      AB
182     AA      BB
132     AA      BB

ADD COMMENT • link 5.5 years ago by shenwei356 8.7k

0

Entering edit mode

thank you for your help. Is there way to not have it print to the screen. i'm working with very large files.

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30

1

Entering edit mode

Learn some basic knowledge of SHELL: I/O Redirection

ADD REPLY • link 5.5 years ago by shenwei356 8.7k

0

Entering edit mode

thanks. with real datasets. i get an error message saying:

csvtk: Argument list too long

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30

2

Entering edit mode

5.5 years ago

h.mon 35k

Here is a solution with common unix tools.

Get number of files to be merged:

nfiles=$(ls -1 *_file | wc -l)

Get columns in common:

columns=$(head -q -n1 *_file \
  | tr " " "\n" \
  | sort \
  | uniq -c \
  | awk -v nf=$nfiles '$1 == nf {print $2}' \
  | tr "\n" " ")

Create awk script to print selected columns

cat << 'EOF' > t.awk
#https://unix.stackexchange.com/a/25144
BEGIN {
    split(cols,out," ")
}
NR==1 {
    for (i=1; i<=NF; i++)
        ix[$i] = i
}
NR>1 {
    for (i in out)
        printf "%s%s", $ix[out[i]], OFS
    print ""
}
EOF

Print header and the loop over files, printing selected columns

echo "$columns" >output.snps
for i in *_files
do
    awk -f t.awk -v cols="$columns" $i >>output.snps
done

ADD COMMENT • link 5.5 years ago by h.mon 35k

0

Entering edit mode

Hi,

Thanks for your help. it works with a small test data but with real datasets, it gives me an error in the last awk loop saying that

:"line 29: /bin/awk: Argument list too long"

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30

0

Entering edit mode

What are the sizes (rows vs columns) of the real datasets?

ADD REPLY • link 5.5 years ago by h.mon 35k

0

Entering edit mode

the largest file has 2500 rows and 56000 columns

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30

finswimmer · Accepted Answer · 2019-11-17

5

Entering edit mode

5.5 years ago

finswimmer 16k

Hello,

you don't want to "merge" in the terms how pandas use it. You want to append the data to each other and drop columns that have no values. You can use this script, like this: python merge.py test1_file test2_file test3_file

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import sys

import pandas


def get_args():
    parser = argparse.ArgumentParser(prog="merge.py", description="")
    parser.add_argument("--version", action="version", version="%(prog)s 0.1")

    parser.add_argument("input", nargs="+", help="input files")
    parser.add_argument(
        "-o",
        "--output",
        type=argparse.FileType("w"),
        default=sys.stdout,
        help="output file",
    )

    if len(sys.argv) == 1:
        parser.print_help(sys.stderr)
        sys.exit(1)

    args = parser.parse_args()

    return args


def main():
    args = get_args()
    result = pandas.DataFrame()

    for input_file in args.input:
        in_df = pandas.read_csv(input_file, sep=" ", header=0, dtype=object)
        result = result.append(in_df)

    result = result.set_index("id").dropna(axis="columns")

    result.to_csv(args.output, sep=" ")


if __name__ == "__main__":
    main()

ADD COMMENT • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

Thank you very much for your help. i tested it with the sample data above but it doesn't seem to be working.

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30

0

Entering edit mode

Hello,

please add more detail about what "doesn't seem to be working".

Thanks.

fin swimmer

ADD REPLY • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

this is what i get:

  FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  sort=sort,
id


BB

BB
BB

ADD REPLY • link updated 5.5 years ago by finswimmer 16k • written 5.5 years ago by elhamidihay ▴ 30

0

Entering edit mode

The warning is just a warning. You can ignore it for now or do what it suggest.

But there must be something strange with your input. I've tested it with the data you gave above and that's the output I get:

id SNP2 SNP3
162 BB AA
172 BB AB
189 BB AA
1590 AB BB
182 BB AA
132 BB AA

ADD REPLY • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

Should the input files be a certain format? I get a syntax error.

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30

0

Entering edit mode

The columns must be separated by whitespaces and not tabs. That's the only requirement.

I get a syntax error.

When ever you receive any kind of error or something "doesn't work": Please give us the details!

fin swimmer

ADD REPLY • link 5.5 years ago by finswimmer 16k

0

Entering edit mode

Thanks.The whitespace delimited fixed it.

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30

0

Entering edit mode

I VERY much appreciate your help

ADD REPLY • link 5.5 years ago by elhamidihay ▴ 30