I want to merge the genes in a bedfile so that only the parts of the gene that overlap become a new feature. E.g., have this bedfile:
1 pseudogene gene 11869 14412 . + . gene_id "ENSG00000223972"
1 pseudogene gene 14363 29806 . - . gene_id "ENSG00000227232";
1 lincRNA gene 29554 31109 . + . gene_id "ENSG00000243485";
which I can merge with
bedtools merge -c 1,2,3,4,5,6,7,8,9 -d -1 -o distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct
which gives me
1 11868 31109 1 lincRNA,pseudogene gene 11869,14363,29554 14412,29806,31109 . +,- . gene_id "ENSG00000223972";, gene_id "ENSG00000227232",gene_id "ENSG00000243485";
But what I would like to have is something like this
1 pseudogene gene 11869 14362 . + . gene_id "ENSG00000223972";
1 pseudogene gene 14363 14412 . + . gene_id "ENSG00000223972";,gene_id "ENSG00000223972";
1 pseudogene gene 14413 29553 . +,- . gene_id "ENSG00000227232";
1 lincRNA,pseudogene gene 29554 29806 . +,- . gene_id "ENSG00000227232";,gene_id "ENSG00000243485";
1 lincRNA gene 29807 31109 . + . gene_id "ENSG00000243485";
So only the actually overlapping parts are made into separate features. Is this possible with bedtools or is there some other tool that does this?
Thanks for the detailed explanation! For others that find this question, this is how I went from ensemble GTF to BED with CHR, start, stop, ensemble gene ID:
output:
Or you could just use BEDOPS
gtf2bed
:The
gtf2bed
call will create sorted BED.