Bedtools merge using strand specificity?
1
0
Entering edit mode
9.5 years ago
cyril-cros ▴ 950

I have a de novo annotation file in gtf format, which I converted to a sorted bed file (see below).

I have been trying to use bedtools merge -s -d 10 -i outSampleSorted.bed but I get no output of any kind. The -s flag is to ensure I only fuse features on the same strand.

I am trying to merge all the features of a gene together. Thanks for your help!

[cyril@synapse tmp]$ head outSorted.bed 
chr1    3199741    3207317    0.0    -    sol    exon    .    locus_id "locus.00012320";type "3p_exon"
chr1    3199751    3207317    0.0    -    sol    exon    .    locus_id "locus.00012320";type "3p_exon"
chr1    3207317    3213438    0.0    -    sol    splice_jnct    .    color "#EE0000";locus_id "locus.00012320"
chr1    3213438    3216968    0.0    -    sol    exon    .    locus_id "locus.00012320";type "internal_exon"
chr1    3216968    3421701    0.0    -    sol    splice_jnct    .    color "#EE0000";locus_id "locus.00012320"
chr1    3322600    3323760    0.0    -    sol    exon    .    locus_id "locus.00012320";type "3p_exon"
chr1    3322750    3323760    0.0    -    sol    exon    .    locus_id "locus.00012320";type "3p_exon"
chr1    3323760    3421701    0.0    -    sol    splice_jnct    .    color "#EE0000";locus_id "locus.00012320"
chr1    3421701    3421901    0.0    -    sol    exon    .    locus_id "locus.00012320";type "internal_exon"
chr1    3421901    3670551    0.0    -    sol    splice_jnct    .    color "#EE0000";locus_id "locus.00012320"
bedtools • 2.5k views
ADD COMMENT
1
Entering edit mode

That's not a BED file, or GTF, or GFF. I imagine the fact that's breaking things...

ADD REPLY
0
Entering edit mode

I just tried using a file derived from cufflinks, using gtf2bed in the classic way and got the same mistake. But you are correct, it is not a BED file I am getting. I will try again with awk to get some form of correct bed format. The gtf I started from was in this format (which is valid).

chr4_JH584293_random    sol    exon    10822    11047    0.0    +    .    locus_id "locus.00000000";type "5p_exon"
chr4_JH584293_random    sol    exon    11191    11251    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12077    12246    0.0    +    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    sol    exon    12375    12489    0.0    +    .    locus_id "locus.00000000";type "internal_exon"

I get this:

chr4_JH584293_random    10821    11047    0.0    +    sol    exon    .    locus_id "locus.00000000";type "5p_exon"
chr4_JH584293_random    11190    11251    0.0    +    sol    exon    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    12076    12246    0.0    +    sol    exon    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    12374    12489    0.0    +    sol    exon    .    locus_id "locus.00000000";type "internal_exon"
chr4_JH584293_random    12642    12675    0.0    +    sol    exon    .    locus_id "locus.00000000";type "internal_exon"

I guess that the fields are not well recognized, especially the 'name' one.

ADD REPLY
0
Entering edit mode
9.5 years ago
cyril-cros ▴ 950

I did not have IDs formatted correctly for gtf2bed, who needs gene_name/gene_id/gene_transcript.

Solved using this:

# No more 'locus_id'
sed 's/locus_id/gene_id/g' foo.gtf > fooMod.gtf
# gene_id is recognized, I get a correct bed format
gtf2bed < fooMod.gtf > foo.bed 
# drop all columns after the strand - they can cause bugs
cut -f-6 foo.bed > fooMod.bed
# This works now
bedtools merge -s -d 10 -i fooMod.bed > merged.bed
ADD COMMENT

Login before adding your answer.

Traffic: 2441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6