I have a .bed file data
that is obtained from concatenating of two .bed files. It's been done through BEDOPS --everything option so, all four columns (chr\start\end\gene_ID) are preserved nicely. For each gene ID, there are a few rows of coordinates that may or may not be overlapped.
I am looking for merging the coordinates belong to each gene separately, so that if they have at least one bp overlap, they will merge, and if not, they will remain separate. Merging should not implement considering all genes in one shot.
I've actually tried bedtools merge and BEDOPS merge, but could not make it because they see the whole file as one.
> data
chr1 206721 208928 ENSG00000951
chr1 207322 208145 ENSG00000951
chr1 312006 314918 ENSG00000885
chr1 312077 312277 ENSG00000885
chr1 313423 314611 ENSG00000885
chr1 315128 315716 ENSG00000885
chr1 235826 238431 ENSG00000082
chr1 242929 244929 ENSG00000627
chr1 247107 249107 ENSG00000627
chr1 249284 252043 ENSG00000627
The expected output would be like this:
> data.output
chr1 206721 208928 ENSG00000951
chr1 235826 238431 ENSG00000082
chr1 312006 314918 ENSG00000885
chr1 315128 315716 ENSG00000885
chr1 242929 244929 ENSG00000627
chr1 247107 249107 ENSG00000627
chr1 249284 252043 ENSG00000627
Thank you.
I really like the rationale behind using the 4th column as chromosome and keeping the chromosome as mapping information with
-c
option. I love finding different ways of using tools that were designed to do a particular job in order to achieve a different goal.The solution works well. Btw, I edited the data.output adding ENSG00000082 that was mistakenly deleted. Could you give me a little explanation how did you address this to the bedtools and what is the difference between columns for him?
Sure. So in the first line I rearrange the BED file, using column4 ($4) as column 1 ($1) so the gene name is now the chromosome name. That makes basically every gene a unique chromosome. BEDtools merges by chromosome and position. Since we used the gene name as chromosome name it will therefore merge only by gene name given there are overlaps. The actual chromosome I moved to $4 to keep it as name, and after merge simply switched it again, moving $1 back to $1 and $1 back to $4. The first sort is necessary is BEDtools expects sorted input, the last sort is optional. Does that make sense to you?
Yup! such a smart workaround:)