Question

Merge overlapping CDS using bed tools

0

Entering edit mode

8.7 years ago

Nathaniel ▴ 120

I have downloaded the genomic coordinates corresponding to coding sequences for the mm10 genome using BioMart, the file looks like this (the last two columns are the cds_start and cds_end, respectively)

enter image description here

I want to merge all overlapping coding regions for each gene SEPARATELY. That is, only merge overlapping regions between transcripts of the same gene. Do not overlap cds regions from different genes.

My current solution is the following, but is very slow:

for gene in `awk '{print $4}' FS="\t" OFS="\t" cds_genebody.bed | sort | uniq`; do
GENE=$gene
echo $gene
grep -w "$gene" cds_genebody.bed | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk -v gene=$GENE '{print $0,gene}' FS="\t" OFS="\t" >> file.merged.bed
done

is it possible to do this in a fast way using bedtools or similar?

bedtools • 3.1k views

ADD COMMENT • link updated 8.7 years ago by Alex Reynolds 36k • written 8.7 years ago by Nathaniel ▴ 120

1

Entering edit mode

You can append the geneid to each of the chromosome names to create a pseudo-chromosome for each transcript. Merge based on the pseuod-chromosomes. Then remove the appended geneid to each pseudo-chromosome.

ADD REPLY • link 8.7 years ago by Damian Kao 16k

0

Entering edit mode

Hello Damian, thanks for the answer, but I don't understand what you propose. What is going to change from having a gene_id for each transcript to having an equivalent pseudo-chromosome id for each transcript? The procedure is going to be equally slow with my bash-based approach

ADD REPLY • link 8.7 years ago by Nathaniel ▴ 120

1

Entering edit mode

Maybe I am not fully understanding what you are trying to do. Right now you are extracting bed entries for each gene and running bedtools merge individually? If you append geneID to each chromosome to create pseudo-chromosomes, you will only have to run bedtools merge once on the modified file. Then just process the resulting file and remove the pseudo-chromosome.

For example, let's say your file is:

chr1 100 200 transcriptA gene01
chr1 150 200 transcriptB gene01
chr1 10 200 transcriptC gene01

chr1 100 200 transcriptA gene02
chr1 150 200 transcriptB gene02
chr1 10 200 transcriptC gene02

If you run bedtools merge on this file, you'll merge transcripts from both gene01 and gene02, which is not what you want. You want to merge transcriptA-C of gene1 and transcriptA-C of gene02 separately.

If you convert your bedfile to this:

chr1_gene01 100 200 transcriptA gene01
chr1_gene01 150 200 transcriptB gene01
chr1_gene01 10 200 transcriptC gene01

chr1_gene02 100 200 transcriptA gene02
chr1_gene02 150 200 transcriptB gene02
chr1_gene02 10 200 transcriptC gene02

And run bedtools merge. Now bedtools will properly merge the genes separately, because the chromosomes are different. From the resulting merged file, you'll just have to remove the _gene0X tags from the chromosomes.

ADD REPLY • link 8.7 years ago by Damian Kao 16k

0

Entering edit mode

now I get it, very smart trick, thank you!!

ADD REPLY • link 8.7 years ago by Nathaniel ▴ 120

score 1 · Answer 1 · 2016-04-06

You don't need to sort every time, if your input is sorted; just sort once at the beginning. Since the output of grep will be in the same order as found in the input, there will be no need to pipe to sort on each loop iteration. The merged output will likely be out of order, so you can do a second sort at the end.

To speed up grep, you can also add the LANG variable and do a fixed string search, instead of searching against a regular expression pattern:

$ sort-bed cds_genebody.unsorted.bed > cds_genebody.bed
$ for gene in `cut -f4 cds_genebody.bed | sort | uniq`; do
    LANG=C
    GENE=$gene
    echo $gene
    grep -F "$gene" cds_genebody.bed | ... >> file.merged.unsorted.bed
done
$ sort-bed file.merged.unsorted.bed > file.merged.bed

You can also munge the chromosome name, but with a couple small changes the performance of your original approach can be improved a lot.

Another option is to parallelize the search work with GNU Parallel, since work done in each iteration of that for loop is independent of all the others. For example:

$ sort-bed cds_genebody.unsorted.bed > cds_genebody.bed
$ cut -f4 cds_genebody.bed \
    | sort \
    | uniq \
    | parallel --pipe "echo {} && grep -F {} cds_genebody.bed | bedops --merge - | awk -v gene={} '{print \$0, gene}' >> file.merged.unsorted.bed"
$ sort-bed file.merged.unsorted.bed > file.merged.bed

There's a great thread here with usage examples of GNU Parallel in bioinformatics settings.