Question

merge two files

0

Entering edit mode

11.0 years ago

biolab ★ 1.4k

Dear all,

I want to merge two data files. The first contains three columns($a, $c, $d), while the second contains two columns ($b, $c). I need to generate a file having four columns ($a, $b, $c, $d). I have problem with my perl script shown below. Could anyone help? Or there should be easier way. Thank you very much!

data1.txt

mir-a        gene1        33
mir-a        gene2        34
mir-a        gene3        89
mir-b        gene1        09
mir-b        gene3        33
mir-c        gene1        86
mir-c        gene2        20

data2.txt

group1        gene1
group1        gene3
group2        gene1
group3        gene1
group3        gene2

merged result should be：

mir-a        group1        gene1        33
mir-a        group2        gene1        33
mir-a        group3        gene1        33
mir-a        group3        gene2        34
mir-a        group1        gene3        89
mir-b        group1        gene1        09
mir-b        group2        gene1        09
mir-b        group3        gene1        09
mir-b        group1        gene3        33
mir-c        group1        gene1        86
mir-c        group2        gene1        86
mir-c        group3        gene1        86
mir-c        group3        gene2        20

my problematic perl script

#!/usr/bin/perl -w
use strict;

my $file1 = shift;
my $file2 = shift;

my (%hash1, %hash2, %hash_merge);

#read dat1.txt
open F1, $file1;
while(my $line1 = <F1>){
    chomp $line1;
    my ($a, $c, $d) = split/\s+/,$line1;
    $hash1{$a}{$c}{$d} = 1;
}
close F1;

#read dat2.txt
open F2, $file2;
while(my $line2 = <F2>){
    chomp $line2;
    my ($b, $c) = split/\s+/,$line2;
    $hash2{$b}{$c} = 1;
}
close F2;

#merge hash1 and hash2
foreach my $a (keys %hash1){
    foreach my $c (keys %{$hash1{$a}}{

        foreach my $b (keys %hash2){ #find column $b in dat2.txt
            next unless exists $hash2{$b}{$c};

            foreach my $d (keys %{$hash1{$a}{$c}}){
                $hash_merge{$a}{$b}{$c}{$d} = 1;
            }

        }

    }
}

#print out hash_merge
foreach my $a (keys %hash_merge){
    foreach my $b (keys %{$hash_merge{$a}}{
        foreach my $c (keys %{$hash_merge{$a}{$b}}{
            foreach my $d (keys %{$hash_merge{$a}{$b}{$c}}{
                print "$a\t$b\t$c\t$d\n";
            }
        }
    }
}

perl • 2.9k views

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.0 years ago by biolab ★ 1.4k

Ram · Answer 1 · 2014-08-21

If you are not settled with perl and you favor a simple solution, I would go for R:

d1<- read.table('data1.txt')
d2<- read.table('data2.txt')

merged<- merge(d1, d2, by= c('V2'))[, c(2, 1, 3, 4)]
write.table(merged, 'merged.txt', col.names= FALSE, row.names= FALSE, sep= '\t', quote= FALSE)

Merged file is going to be merged.txt:

mir-a    gene1    33    group1
mir-a    gene1    33    group2
mir-a    gene1    33    group3
mir-b    gene1    9    group1
mir-b    gene1    9    group2
mir-b    gene1    9    group3
mir-c    gene1    86    group1
mir-c    gene1    86    group2
mir-c    gene1    86    group3
mir-a    gene2    34    group3
mir-c    gene2    20    group3
mir-a    gene3    89    group1
mir-b    gene3    33    group1

This assuming RAM is not a limiting factor as R will load everything in memory. If speed as an issue, I would use the data.table package.

Even less verbose is unix join command:

join -1 2 -2 2 <(sort -k2,2 data1.txt) <(sort -k2,2 data2.txt)

But it's not as flexible as the R solution.

Ram · Answer 2 · 2014-08-21

Untested, but I think the following should work, by simplifying the keys used in your hash tables:

#!/usr/bin/perl

use strict;
use warnings;

my $fn1 = $ARGV[0];
my $fn2 = $ARGV[1];

# we assume that this character can always delimit mir-X and geneY names
my $delimiter = "%"; 

my $mirgenes;
open my $fh1, "<", $fn1 or die "could not open $fn1\n";
while (<$fh1>) {
    chomp;
    my ($mir, $gene, $count) = split("\t", $_);
    my $mirgene = $mir.$delimiter.$gene;
    $mirgenes->{$mirgene} = $count;
}
close $fh1;

my $groups;
open my $fh2, "<", $fn2 or die "could not open $fn2\n";
while (<$fh2>) {
    chomp;
    my ($group, $gene) = split("\t", $_);
    if (! defined $groups->{$gene}) {
        @{$groups->{$gene}} = ();
    }
    push(@{$groups->{$gene}}, $group);
}
close $fh2;

foreach my $mirgene (keys %{$mirgenes}) {
    my ($mir, $gene) = split($delimiter, $mirgene);
    my $count = $mirgenes->{$mirgene};
    foreach my $group (@{$groups->{$gene}}) {
        print STDOUT "$mir\t$group\t$gene\t$count\n";
    }
}

To use, perhaps:

$ ./mirgenegroup.pl data1.txt data2.txt | sort -k1,3 > answer.txt