Bedtools merge only part of feature that overlaps, not whole feature
1
0
Entering edit mode
7.3 years ago
Niek De Klein ★ 2.6k

I want to merge the genes in a bedfile so that only the parts of the gene that overlap become a new feature. E.g., have this bedfile:

1    pseudogene      gene    11869   14412   .       +       .       gene_id "ENSG00000223972" 
1    pseudogene      gene    14363   29806   .       -       .       gene_id "ENSG00000227232"; 
1    lincRNA         gene    29554   31109   .       +       .       gene_id "ENSG00000243485";

which I can merge with

bedtools merge -c 1,2,3,4,5,6,7,8,9 -d -1 -o distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct

which gives me

1   11868   31109   1   lincRNA,pseudogene  gene    11869,14363,29554   14412,29806,31109   .   +,- .   gene_id "ENSG00000223972";, gene_id "ENSG00000227232",gene_id "ENSG00000243485";

But what I would like to have is something like this

1    pseudogene      gene    11869   14362    .       +       .       gene_id "ENSG00000223972";
1    pseudogene      gene    14363   14412    .       +       .       gene_id "ENSG00000223972";,gene_id "ENSG00000223972";
1    pseudogene      gene    14413   29553   .       +,-       .       gene_id "ENSG00000227232";
1    lincRNA,pseudogene         gene    29554   29806   .       +,-       .        gene_id "ENSG00000227232";,gene_id "ENSG00000243485";
1    lincRNA         gene    29807   31109   .       +       .       gene_id "ENSG00000243485";

So only the actually overlapping parts are made into separate features. Is this possible with bedtools or is there some other tool that does this?

bedtools merge bed • 2.2k views
ADD COMMENT
3
Entering edit mode
7.3 years ago

See BEDOPS bedops --partition: http://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html#partition-p-partition

You'll need to munge your input into a correctly-ordered and sort-bed-sorted BED file, run the partitioning, and then follow up with bedmap to apply annotation columns to partitioned intervals, and then re-munge into whatever format you're working with.

I can offer a rough sketch of what you need to do. You'll need to do some more work to get it to look exactly the way you want, but this should show the general principles, which you can expand upon for your input.

Start with your matrix file:

$ cat baz.mtx
1   pseudogene  gene    11869   14412   .   +   .   gene_id "ENSG00000223972";
1   pseudogene  gene    14363   29806   .   -   .   gene_id "ENSG00000227232";
1   lincRNA gene    29554   31109   .   +   .   gene_id "ENSG00000243485";

Convert it to BED:

$ awk -v OFS='\t' '{ print $1, $4, $5, $2"-"$3; }' baz.mtx | sort-bed - > baz.bed

For demo purposes, I'm concatenating the second and third columns of your matrix file into something that can be treated as a pseudo-ID. You can use any sentinel character you want here, to condense more annotation data columns into the ID.

This is what the sorted BED file baz.bed looks like:

$ cat baz.bed
1   11869   14412   pseudogene-gene
1   14363   29806   pseudogene-gene
1   29554   31109   lincRNA-gene

Then partition the BED file and map its partitions back to itself, printing out the mock IDs:

$ bedops --partition baz.bed | bedmap --echo --echo-map-id-uniq --delim '\t' - baz.bed
1   11869   14363   pseudogene-gene
1   14363   14412   pseudogene-gene
1   14412   29554   pseudogene-gene
1   29554   29806   lincRNA-gene;pseudogene-gene
1   29806   31109   lincRNA-gene

You should be able to work out the logic from here as far how you deal with the non-BED annotation columns and reshuffle columns into your non-BED format.

ADD COMMENT
0
Entering edit mode

Thanks for the detailed explanation! For others that find this question, this is how I went from ensemble GTF to BED with CHR, start, stop, ensemble gene ID:

awk '{if ($3 == "gene") print $0;}' Homo_sapiens.GRCh37.75.gtf \
    | bedtools sort -i stdin \
    | awk  '{print $1 "\t" $4 "\t" $5 "\t" $10}'  \
    | sed 's/"//g' - | sed 's/;//g' \
    > tmp.bed
bedops --partition tmp.bed \
    | bedmap \
        --echo \
        --echo-map-id-uniq \
        --delim '\t' \
        - tmp.bed \
        > Homo_sapiens.GRCh37.75.metaGenes.bed
rm tmp.bed

output:

1       11869   14363   ENSG00000223972
1       14363   14412   ENSG00000223972;ENSG00000227232
1       14412   29554   ENSG00000227232
1       29554   29806   ENSG00000227232;ENSG00000243485
ADD REPLY
1
Entering edit mode

Or you could just use BEDOPS gtf2bed:

$ awk '$3 == "gene"' Homo_sapiens.GRCh37.75.gtf | gtf2bed - | ... > tmp.bed

The gtf2bed call will create sorted BED.

ADD REPLY

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6