Question

Bedtools merge only part of feature that overlaps, not whole feature

0

Entering edit mode

7.3 years ago

Niek De Klein ★ 2.6k

I want to merge the genes in a bedfile so that only the parts of the gene that overlap become a new feature. E.g., have this bedfile:

1    pseudogene      gene    11869   14412   .       +       .       gene_id "ENSG00000223972" 
1    pseudogene      gene    14363   29806   .       -       .       gene_id "ENSG00000227232"; 
1    lincRNA         gene    29554   31109   .       +       .       gene_id "ENSG00000243485";

which I can merge with

bedtools merge -c 1,2,3,4,5,6,7,8,9 -d -1 -o distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct

which gives me

1   11868   31109   1   lincRNA,pseudogene  gene    11869,14363,29554   14412,29806,31109   .   +,- .   gene_id "ENSG00000223972";, gene_id "ENSG00000227232",gene_id "ENSG00000243485";

But what I would like to have is something like this

1    pseudogene      gene    11869   14362    .       +       .       gene_id "ENSG00000223972";
1    pseudogene      gene    14363   14412    .       +       .       gene_id "ENSG00000223972";,gene_id "ENSG00000223972";
1    pseudogene      gene    14413   29553   .       +,-       .       gene_id "ENSG00000227232";
1    lincRNA,pseudogene         gene    29554   29806   .       +,-       .        gene_id "ENSG00000227232";,gene_id "ENSG00000243485";
1    lincRNA         gene    29807   31109   .       +       .       gene_id "ENSG00000243485";

So only the actually overlapping parts are made into separate features. Is this possible with bedtools or is there some other tool that does this?

bedtools merge bed • 2.2k views

ADD COMMENT • link updated 7.3 years ago by Alex Reynolds 36k • written 7.3 years ago by Niek De Klein ★ 2.6k

score 3 · Accepted Answer · 2017-08-17

See BEDOPS bedops --partition: http://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#partition-p-partition

You'll need to munge your input into a correctly-ordered and sort-bed-sorted BED file, run the partitioning, and then follow up with bedmap to apply annotation columns to partitioned intervals, and then re-munge into whatever format you're working with.

I can offer a rough sketch of what you need to do. You'll need to do some more work to get it to look exactly the way you want, but this should show the general principles, which you can expand upon for your input.

Start with your matrix file:

$ cat baz.mtx
1   pseudogene  gene    11869   14412   .   +   .   gene_id "ENSG00000223972";
1   pseudogene  gene    14363   29806   .   -   .   gene_id "ENSG00000227232";
1   lincRNA gene    29554   31109   .   +   .   gene_id "ENSG00000243485";

Convert it to BED:

$ awk -v OFS='\t' '{ print $1, $4, $5, $2"-"$3; }' baz.mtx | sort-bed - > baz.bed

For demo purposes, I'm concatenating the second and third columns of your matrix file into something that can be treated as a pseudo-ID. You can use any sentinel character you want here, to condense more annotation data columns into the ID.

This is what the sorted BED file baz.bed looks like:

$ cat baz.bed
1   11869   14412   pseudogene-gene
1   14363   29806   pseudogene-gene
1   29554   31109   lincRNA-gene

Then partition the BED file and map its partitions back to itself, printing out the mock IDs:

$ bedops --partition baz.bed | bedmap --echo --echo-map-id-uniq --delim '\t' - baz.bed
1   11869   14363   pseudogene-gene
1   14363   14412   pseudogene-gene
1   14412   29554   pseudogene-gene
1   29554   29806   lincRNA-gene;pseudogene-gene
1   29806   31109   lincRNA-gene

You should be able to work out the logic from here as far how you deal with the non-BED annotation columns and reshuffle columns into your non-BED format.