merge multiple files by commom column names
3
0
Entering edit mode
5.0 years ago
elhamidihay ▴ 30

I have multiple whitespace delimited files with a different number of columns. I would like to join them and keep the common column names.

test1_file

id SNP1 SNP2 SNP3
162 AA BB AA
172 AB BB AB

test2_file

id SNP1 SNP2 SNP3 SNP4
189 BB BB AA AB
1590 AA AB BB AA

test3_file

id SNP8 SNP2 SNP3 SNP9
182 AA BB AA AA
132 BB BB AA BB

desired output file:

id SNP2 SNP3
162 BB AA
172 BB AB
189 BB AA
1590 AB BB
182 BB AA
132 BB AA

my code: Its not working, and i would like to add more files than just two

import pandas as pd

test1_file = pd.read_csv('2014_07_03', delim_whitespace = True)
test2_file = pd.read_csv('2019_08_19', delim_whitespace = True)

f = pd.merge(left=test1_file, right=test2_file , how='left', on='MergeCol')
f.to_csv('outfile.csv')
pandas python • 3.2k views
ADD COMMENT
5
Entering edit mode
5.0 years ago

Hello,

you don't want to "merge" in the terms how pandas use it. You want to append the data to each other and drop columns that have no values. You can use this script, like this: python merge.py test1_file test2_file test3_file

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import sys

import pandas


def get_args():
    parser = argparse.ArgumentParser(prog="merge.py", description="")
    parser.add_argument("--version", action="version", version="%(prog)s 0.1")

    parser.add_argument("input", nargs="+", help="input files")
    parser.add_argument(
        "-o",
        "--output",
        type=argparse.FileType("w"),
        default=sys.stdout,
        help="output file",
    )

    if len(sys.argv) == 1:
        parser.print_help(sys.stderr)
        sys.exit(1)

    args = parser.parse_args()

    return args


def main():
    args = get_args()
    result = pandas.DataFrame()

    for input_file in args.input:
        in_df = pandas.read_csv(input_file, sep=" ", header=0, dtype=object)
        result = result.append(in_df)

    result = result.set_index("id").dropna(axis="columns")

    result.to_csv(args.output, sep=" ")


if __name__ == "__main__":
    main()
ADD COMMENT
0
Entering edit mode

Thank you very much for your help. i tested it with the sample data above but it doesn't seem to be working.

ADD REPLY
0
Entering edit mode

Hello,

please add more detail about what "doesn't seem to be working".

Thanks.

fin swimmer

ADD REPLY
0
Entering edit mode

this is what i get:

  FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  sort=sort,
id


BB

BB
BB
ADD REPLY
0
Entering edit mode

The warning is just a warning. You can ignore it for now or do what it suggest.

But there must be something strange with your input. I've tested it with the data you gave above and that's the output I get:

id SNP2 SNP3
162 BB AA
172 BB AB
189 BB AA
1590 AB BB
182 BB AA
132 BB AA
ADD REPLY
0
Entering edit mode

Should the input files be a certain format? I get a syntax error.

ADD REPLY
0
Entering edit mode

The columns must be separated by whitespaces and not tabs. That's the only requirement.

I get a syntax error.

When ever you receive any kind of error or something "doesn't work": Please give us the details!

fin swimmer

ADD REPLY
0
Entering edit mode

Thanks.The whitespace delimited fixed it.

ADD REPLY
0
Entering edit mode

I VERY much appreciate your help

ADD REPLY
3
Entering edit mode
5.0 years ago

A solution using parallel and csvtk:

# converting space-delimited to tab-delimited
l$ s *.txt | parallel 'csvtk space2tab {} > {}.tsv'

# retrieving colnames
$ ls *.txt.tsv | parallel 'csvtk headers -t {} | csvtk cut -t -f 2 > {}.colnames'

# common colnames
$ csvtk inter *.colnames
id
SNP3
SNP2

# cut by common colnames
$ ls *.txt.tsv | parallel 'csvtk cut -t -f $(csvtk inter *.colnames | paste -s -d ,) {} > {}.selected'

# concatenating
$ csvtk concat -t *.selected
id      SNP3    SNP2
162     AA      BB
172     AB      BB
189     AA      BB
1590    BB      AB
182     AA      BB
132     AA      BB
ADD COMMENT
0
Entering edit mode

thank you for your help. Is there way to not have it print to the screen. i'm working with very large files.

ADD REPLY
1
Entering edit mode

Learn some basic knowledge of SHELL: I/O Redirection

ADD REPLY
0
Entering edit mode

thanks. with real datasets. i get an error message saying:

csvtk: Argument list too long

ADD REPLY
2
Entering edit mode
5.0 years ago
h.mon 35k

Here is a solution with common unix tools.

Get number of files to be merged:

nfiles=$(ls -1 *_file | wc -l)

Get columns in common:

columns=$(head -q -n1 *_file \
  | tr " " "\n" \
  | sort \
  | uniq -c \
  | awk -v nf=$nfiles '$1 == nf {print $2}' \
  | tr "\n" " ")

Create awk script to print selected columns

cat << 'EOF' > t.awk
#https://unix.stackexchange.com/a/25144
BEGIN {
    split(cols,out," ")
}
NR==1 {
    for (i=1; i<=NF; i++)
        ix[$i] = i
}
NR>1 {
    for (i in out)
        printf "%s%s", $ix[out[i]], OFS
    print ""
}
EOF

Print header and the loop over files, printing selected columns

echo "$columns" >output.snps
for i in *_files
do
    awk -f t.awk -v cols="$columns" $i >>output.snps
done
ADD COMMENT
0
Entering edit mode

Hi,

Thanks for your help. it works with a small test data but with real datasets, it gives me an error in the last awk loop saying that

:"line 29: /bin/awk: Argument list too long"

ADD REPLY
0
Entering edit mode

What are the sizes (rows vs columns) of the real datasets?

ADD REPLY
0
Entering edit mode

the largest file has 2500 rows and 56000 columns

ADD REPLY

Login before adding your answer.

Traffic: 2836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6