I have downloaded the genomic coordinates corresponding to coding sequences for the mm10 genome using BioMart, the file looks like this (the last two columns are the cds_start and cds_end, respectively)
I want to merge all overlapping coding regions for each gene SEPARATELY. That is, only merge overlapping regions between transcripts of the same gene. Do not overlap cds regions from different genes.
My current solution is the following, but is very slow:
for gene in `awk '{print $4}' FS="\t" OFS="\t" cds_genebody.bed | sort | uniq`; do
GENE=$gene
echo $gene
grep -w "$gene" cds_genebody.bed | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk -v gene=$GENE '{print $0,gene}' FS="\t" OFS="\t" >> file.merged.bed
done
is it possible to do this in a fast way using bedtools or similar?
You can append the geneid to each of the chromosome names to create a pseudo-chromosome for each transcript. Merge based on the pseuod-chromosomes. Then remove the appended geneid to each pseudo-chromosome.
Hello Damian, thanks for the answer, but I don't understand what you propose. What is going to change from having a gene_id for each transcript to having an equivalent pseudo-chromosome id for each transcript? The procedure is going to be equally slow with my bash-based approach
Maybe I am not fully understanding what you are trying to do. Right now you are extracting bed entries for each gene and running bedtools merge individually? If you append geneID to each chromosome to create pseudo-chromosomes, you will only have to run bedtools merge once on the modified file. Then just process the resulting file and remove the pseudo-chromosome.
For example, let's say your file is:
If you run bedtools merge on this file, you'll merge transcripts from both gene01 and gene02, which is not what you want. You want to merge transcriptA-C of gene1 and transcriptA-C of gene02 separately.
If you convert your bedfile to this:
And run bedtools merge. Now bedtools will properly merge the genes separately, because the chromosomes are different. From the resulting merged file, you'll just have to remove the _gene0X tags from the chromosomes.
now I get it, very smart trick, thank you!!