how to combine two bed files using the same ID information
4
0
Entering edit mode
4.0 years ago
szp770 ▴ 10

Hi, now I have two bed files. Each have four columuns, and the fourth column has the uniq ID for each row, each file has thousands of rows. Now I want to combine row of the two files if the rows have the same ID, the final output should like this:

chr1    10028   10029    chr14   68314662        68314663     J00118:253:HJ2FTBBXX:3:2213:8491:13394

bed 1:

chr1    10028   10029   J00118:253:HJ2FTBBXX:3:2213:8491:13394
...

bed 2:

chr14   68314662        68314663        J00118:253:HJ2FTBBXX:3:2213:8491:13394
...
bed linux shell python • 1.7k views
ADD COMMENT
0
Entering edit mode
4.0 years ago

save below text to combine_bed_by_id.pl, then perl combine_bed_by_id.pl bed1.txt bed2.txt

use strict;
use warnings;

my ($bed1, $bed2) = @ARGV;

my %bed1 = read_bed($bed1);
my %bed2 = read_bed($bed2);

for my $id (sort keys %bed1){
    print "$bed1{$id}\t$bed2{$id}\t$id\n" if $bed2{$id};
}

sub read_bed{
    my $bed=shift;
    open IN,"$bed";
    my %f;
    while(<IN>){
        chomp;
        my @temp = split;
        my $id = pop @temp;
        my $info = join "\t", @temp;
        $f{$id} = $info;
    }
    return %f;
    close IN;
}
ADD COMMENT
0
Entering edit mode
4.0 years ago
join -t $'\t' -1 4 -2 4 <(sort -t $'\t' -k4,4 file1.bed) <(sort -t $'\t' -k4,4 file2.bed)
ADD COMMENT
0
Entering edit mode

That's really succint!

ADD REPLY
0
Entering edit mode

... and the simplest I would go for. Considering that join output provides the ID in the first column, here's a minimum modification to exactly match the desired output:

join -t $'\t' -1 4 -2 4 <(sort -k4,4 file1.bed) <(sort -k4,4 file2.bed) | perl -pe 's/(\S+)\t(.+)/$2\t$1/'
ADD REPLY
0
Entering edit mode

here's a minimum modification to exactly match the desired output:

you can use the formatting option of join -o FORMAT to achieve the same result ;-)

ADD REPLY
0
Entering edit mode

Good to know. Thank you Pierre.

ADD REPLY
0
Entering edit mode

Hey, what if I add the 5th column to each file and still want to join by the same 4th column value and reserve the 5th column information in the final result? Thanks!

ADD REPLY
0
Entering edit mode

Pierre's answer would still work. It'll output columns 4, 1-3 and 5 of the first file, plus 1-3 and 5 of the second file. As Pierre mentioned, you may modify the column layout using the -o option. Here's an example that may help you understand hoy join output format works.

ADD REPLY
0
Entering edit mode
4.0 years ago
perl -lane '$d{$F[3]} .= "@F[0..2] "; END {
 foreach $i (keys %d) { print $d{$i}.$i }
}' file1.bed file2.bed | awk '$5~/./'
ADD COMMENT
0
Entering edit mode
4.0 years ago

Here's one that might be a little simpler to follow:

$ sort -k4,4 A.bed B.bed | paste -d "\t" - - | cut -f1-3,5-8 | sort-bed - > answer.bed

Here's how it works:

sort -k4,4 A.bed B.bed - sort the concatenation of A and B by the fourth column

paste -d "\t" - - - take every two lines from the output of sort and join them by a tab character

cut -f1-3,5-8 - take columns 1 through 3 and 5 through 8 of the output from paste

sort-bed - - sort the output lexicographically for downstream set operations

ADD COMMENT
0
Entering edit mode

That's really helpful, Thanks so much!

ADD REPLY
0
Entering edit mode

Any pure sorting solution will only work if both files contain the same IDs and nothing else, as it will pair every 2 lines independently on their ID. A previous selection of IDs present in both files should be performed before using this code. Here's an example nesting 2 cut | grep, one to detect shared IDs and the other to print only lines containing those IDs:

cut -f4 A.bed | grep -F -f - B.bed | cut -f4 | grep -h -F -f - A.bed B.bed \
| sort -k4,4 | paste -d "\t" - - | cut -f1-3,5-8 | sort-bed - > answer.bed

... although this is basically what the join solution does.

ADD REPLY

Login before adding your answer.

Traffic: 1177 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6