Bedtools: merge all features with the same id
2
4
Entering edit mode
9.4 years ago
cyril-cros ▴ 950

I would like to use bedtools merge to collapse together all the features sharing a same gene_id in my bed file (which contains the annotation of various genes - the name column (4th one) also corresponds to the gene_id). Due to splicing, elements may be quite distant.... ​

chr14    49894259    49895806    ENSMUST00000053290    0.000000    -    mm10_ensGene    exon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49894873    49894876    ENSMUST00000053290    0.000000    -    mm10_ensGene    stop_codon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49894876    49895800    ENSMUST00000053290    0.000000    -    mm10_ensGene    CDS    0    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49895797    49895800    ENSMUST00000053290    0.000000    -    mm10_ensGene    start_codon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14    49901908    49901941    ENSMUST00000053290    0.000000    -    mm10_ensGene    exon    .    gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
annotation bedtools bed • 12k views
ADD COMMENT
1
Entering edit mode

What are you really trying to achieve in the end? For example, do you need coding gene lengths because you want to calculate transcript-size corrected gene expression values from RNA sequencing (aka [FRC]PKM?) In that case, have a look at this post.

ADD REPLY
0
Entering edit mode

Nice link. I wanted to collapse my annotation to a more convenient form to intersect a de novo annotation to a reference annotation with gene_ids (see How To Get Annotation For Bed File From Another Bed File). I do this to assign the official Ensembl ids to my new annotation. I don't really need the whole transcript...

However, some parts of my annotation have a rather insufficient coverage - your script will help me a lot to identify them.

ADD REPLY
0
Entering edit mode

Are you starting out with a GTF or a bed file? I think I have done something similar starting out with a GTF and using bedtools merge.

ADD REPLY
9
Entering edit mode
9.4 years ago

Hi- You don't say what to do with the collapsed features. This piece of code collapses the gene ids and counts how many features have been grouped together along with the min and max coordinates. Note it uses bedtools groupby not merge.

Sample data:

cat genes.bed
chr14    49894259    49895806    ENSMUST00000053290    0.000000    ...
chr14    49894873    49894876    ENSMUST00000053290    0.000000    ...
chr14    49894876    49895800    ENSMUST00000053291    0.000000    ...
chr14    49895797    49895800    ENSMUST00000053291    0.000000    ...
chr14    49901908    49901941    ENSMUST00000053291    0.000000    ...

Example output:

sort -k4,4 genes.bed \
| groupBy -g 1,4 -c 4,2,3 -o count,min,max \
| awk -v OFS='\t' '{print $1, $4, $5, $2, $3}'

chr14    49894259    49895806    ENSMUST00000053290    2
chr14    49894876    49901941    ENSMUST00000053291    3
ADD COMMENT
1
Entering edit mode

The code didn't work for me at first. The groupBy function summarized all lines of the bed file and the output file contained one entry which covered the whole genome. It took me a while to figure out that the problem was the recent bedtools version 2.26.0. When I switched back to version 2.25.0 everything worked as expected. Thanks for your solution anyway, it helped me a lot.

ADD REPLY
0
Entering edit mode

Thanks, it does the job nicely.

ADD REPLY
0
Entering edit mode

Thanks! That is exactly what I was looking for. Best

ADD REPLY
0
Entering edit mode

How is it possible to run this code on newer versions of bedtools without summarizing all lines of each ID to one?

ADD REPLY
1
Entering edit mode
9.4 years ago

Here's an untested way to do this with BEDOPS. Merging intervals in a BED file is "destructive" to input regions, so there's no need to parse out anything beyond the first four fields.

#!/bin/bash +x

inFn=$1
outDir=$2
tmpDir="/tmp/split"
mkdir -p $tmpDir
mkdir -p $outDir
awk -v tmpDir=$tmpDir '
    BEGIN {
        while (getline < "'"$inFn"'") {
            n = split($0, elems, "\t");
            chr = elems[1];
            start = elems[2];
            stop = elems[3];
            id = elems[4];
            out_fn = tmpDir"/"id".id.bed";
            print chr"\t"start"\t"stop >> out_fn;
        }
        close("'"$inFn"'");
    }
'
for $perIdFn in `ls $tmpDir/*.id.bed`;
do
    perIdBaseFn=$(basename "$perIdFn")
    bedops --merge $perIdFn > $outDir/$perIdBaseFn.merged.bed
done
rm -rf $tmpDir

Usage:

$ ./splitByIdAndMerge.bash unmerged_features.bed merged_features_directory
ADD COMMENT
0
Entering edit mode

Thanks for your help. My main issue was that I could not easily obtain a relation in the style of one gene, one large interval. I wanted to find which genes were included in a large bed file.

By only taking the start_codon feature, I have a unique location for each gene. Granted, I miss any small overlap... I will still try your solution and validate it later if it works. A simple python parser might also work if I sort by gene_id my entries...

ADD REPLY
0
Entering edit mode

You might look into BEDOPS bedmap. Given an interval and a set of features (both sorted per BEDOPS sort-bed), bedmap will report all the overlapping features for a given interval. You can then chop up results with Python, Perl, etc.

ADD REPLY

Login before adding your answer.

Traffic: 1029 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6