merge two files
2
0
Entering edit mode
10.2 years ago
biolab ★ 1.4k

Dear all,

I want to merge two data files. The first contains three columns($a, $c, $d), while the second contains two columns ($b, $c). I need to generate a file having four columns ($a, $b, $c, $d). I have problem with my perl script shown below. Could anyone help? Or there should be easier way. Thank you very much!

data1.txt

mir-a        gene1        33
mir-a        gene2        34
mir-a        gene3        89
mir-b        gene1        09
mir-b        gene3        33
mir-c        gene1        86
mir-c        gene2        20

data2.txt

group1        gene1
group1        gene3
group2        gene1
group3        gene1
group3        gene2

merged result should be:

mir-a        group1        gene1        33
mir-a        group2        gene1        33
mir-a        group3        gene1        33
mir-a        group3        gene2        34
mir-a        group1        gene3        89
mir-b        group1        gene1        09
mir-b        group2        gene1        09
mir-b        group3        gene1        09
mir-b        group1        gene3        33
mir-c        group1        gene1        86
mir-c        group2        gene1        86
mir-c        group3        gene1        86
mir-c        group3        gene2        20

my problematic perl script

#!/usr/bin/perl -w
use strict;

my $file1 = shift;
my $file2 = shift;

my (%hash1, %hash2, %hash_merge);

#read dat1.txt
open F1, $file1;
while(my $line1 = <F1>){
    chomp $line1;
    my ($a, $c, $d) = split/\s+/,$line1;
    $hash1{$a}{$c}{$d} = 1;
}
close F1;

#read dat2.txt
open F2, $file2;
while(my $line2 = <F2>){
    chomp $line2;
    my ($b, $c) = split/\s+/,$line2;
    $hash2{$b}{$c} = 1;
}
close F2;

#merge hash1 and hash2
foreach my $a (keys %hash1){
    foreach my $c (keys %{$hash1{$a}}{

        foreach my $b (keys %hash2){ #find column $b in dat2.txt
            next unless exists $hash2{$b}{$c};

            foreach my $d (keys %{$hash1{$a}{$c}}){
                $hash_merge{$a}{$b}{$c}{$d} = 1;
            }

        }

    }
}

#print out hash_merge
foreach my $a (keys %hash_merge){
    foreach my $b (keys %{$hash_merge{$a}}{
        foreach my $c (keys %{$hash_merge{$a}{$b}}{
            foreach my $d (keys %{$hash_merge{$a}{$b}{$c}}{
                print "$a\t$b\t$c\t$d\n";
            }
        }
    }
}
perl • 2.7k views
ADD COMMENT
3
Entering edit mode
10.2 years ago

If you are not settled with perl and you favor a simple solution, I would go for R:

d1<- read.table('data1.txt')
d2<- read.table('data2.txt')

merged<- merge(d1, d2, by= c('V2'))[, c(2, 1, 3, 4)]
write.table(merged, 'merged.txt', col.names= FALSE, row.names= FALSE, sep= '\t', quote= FALSE)

Merged file is going to be merged.txt:

mir-a    gene1    33    group1
mir-a    gene1    33    group2
mir-a    gene1    33    group3
mir-b    gene1    9    group1
mir-b    gene1    9    group2
mir-b    gene1    9    group3
mir-c    gene1    86    group1
mir-c    gene1    86    group2
mir-c    gene1    86    group3
mir-a    gene2    34    group3
mir-c    gene2    20    group3
mir-a    gene3    89    group1
mir-b    gene3    33    group1

This assuming RAM is not a limiting factor as R will load everything in memory. If speed as an issue, I would use the data.table package.

Even less verbose is unix join command:

join -1 2 -2 2 <(sort -k2,2 data1.txt) <(sort -k2,2 data2.txt)

But it's not as flexible as the R solution.

ADD COMMENT
0
Entering edit mode

Alex and dariober, thanks a lot for your inputs. Really helpful!

ADD REPLY
2
Entering edit mode
10.2 years ago

Untested, but I think the following should work, by simplifying the keys used in your hash tables:

#!/usr/bin/perl

use strict;
use warnings;

my $fn1 = $ARGV[0];
my $fn2 = $ARGV[1];

# we assume that this character can always delimit mir-X and geneY names
my $delimiter = "%"; 

my $mirgenes;
open my $fh1, "<", $fn1 or die "could not open $fn1\n";
while (<$fh1>) {
    chomp;
    my ($mir, $gene, $count) = split("\t", $_);
    my $mirgene = $mir.$delimiter.$gene;
    $mirgenes->{$mirgene} = $count;
}
close $fh1;

my $groups;
open my $fh2, "<", $fn2 or die "could not open $fn2\n";
while (<$fh2>) {
    chomp;
    my ($group, $gene) = split("\t", $_);
    if (! defined $groups->{$gene}) {
        @{$groups->{$gene}} = ();
    }
    push(@{$groups->{$gene}}, $group);
}
close $fh2;

foreach my $mirgene (keys %{$mirgenes}) {
    my ($mir, $gene) = split($delimiter, $mirgene);
    my $count = $mirgenes->{$mirgene};
    foreach my $group (@{$groups->{$gene}}) {
        print STDOUT "$mir\t$group\t$gene\t$count\n";
    }
}

To use, perhaps:

$ ./mirgenegroup.pl data1.txt data2.txt | sort -k1,3 > answer.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6