Mergebed Cannot Keep Exon Intron Structure
2
0
Entering edit mode
10.7 years ago

enter image description hereHi, I have bed file and i am trying to merge the overlapping transcripts based on genomic coordinates into a single transcript. For this i used mergeBed but what i found was mergeBed basically make a long single contig without keeping the intron exon structure of the overlapping transcripts (see attached picture and example below). Is there a way to keep the structure? Also the output from the mergeBed basically is different from input bed file (see below again)![enter image description here][2] and so i am wondering is there a way to keep the structure of original bed file as well and still make an overalpping contig from the overalpped transcripts?

# Input file:
A01     1330942 1331155 comp293731_c0_seq1|m.7724       94      -       1330942 1331155 255,0,0 1       213     0
A01     1331135 1331557 comp94935_c0_seq2|m.8355        2       +       1331135 1331557 255,0,0 3       49,113,41       0,123,381

# output file
A01     1330942 1331557 comp293731_c0_seq1|m.7724;comp94935_c0_seq2|m.8355

Thanks Upendra

structure • 2.6k views
ADD COMMENT
0
Entering edit mode

Please post an example of "wanted output".

ADD REPLY
0
Entering edit mode

Ok all. I have edited my post to give more information of what i wanted. Basically i have bed file that have four overlapping transcripts (see Denovo filtered track in the picture) and all i wanted is to have a consensus/overlapping contig. For this i have used mergedBed/bedops tools but both of them gave me a bed file that have a single contig without preserving intron-exon structure (See merged Track above). Am i not thinking well here? I don't know if it can be done with any tool.

ADD REPLY
0
Entering edit mode

Can you write up a sample BED file that hints at the format of what you need, given the example images provided? It's likely the use of bedops and bedmap alone won't do what you need, but perhaps those two apps with awk may work. Hopefully this helps clarify the request for clarification.

ADD REPLY
0
Entering edit mode

This is what want

# input
A03     15742539        15744798        BRME_15388|m.2560       20      +       15742539        15744798        255,0,0 6       64,448,143,287,76,166   0,177,985,1203,1934,2093
A03     15742539        15744798        TCONS_00018124|m.13266  20      +       15742539        15744798        255,0,0 6       64,448,143,287,76,166   0,177,985,1203,1934,2093
A03     15742539        15745142        TCONS_00018125|m.13273  19      +       15742539        15745142        255,0,0 8       64,448,143,287,76,92,83,175     0,177,985,1203,1934,2093,2258,2428
A03     15742539        15745142        comp263233_c0_seq2|m.10240      19      +       15742539        15745142        255,0,0 8       64,448,143,287,76,92,83,175     0,177,985,1203,1934,2093,2258,2428

# merged
A03     15742539        15745142        BRME_15388|m.2560;TCONS_00018124|m.13266;TCONS_00018125|m.13273;comp263233_c0_seq2|m.10240

# desired bed
A03     15742539        15745142        comp263233_c0_seq2|m.10240      19      +       15742539        15745142        255,0,0 8       64,448,143,287,76,92,83,175     0,177,985,1203,1934,2093,2258,2428
ADD REPLY
0
Entering edit mode

The desired bed category looks like the fourth line of input.

ADD REPLY
0
Entering edit mode

That is right in this case. Basically all i wanted is for those overlapped transcripts pick the one that is longest without merging the introns and exons.

ADD REPLY
1
Entering edit mode
10.7 years ago

Okay, start by making a contigs file:

$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed > contigs.bed

Now use bedmap with both --echo-map and --echo-map-size to report two pieces of information: a list of mapped transcripts (which overlap the contig) and their associated lengths:

$ bedmap --echo-map --echo-map-size contigs.bed inputs.bed > perContigTranscriptsWithSizes.txt

For each contig, this format is as follows:

transcript-1 ; transcript-2 ; ... ; transcript-N | length-1 ; length-2 ; ... ; length-N

Note that the transcripts and lengths are separated from each other by the pipe character (|), while the lists of individual transcripts and lengths are separated by semi-colons (;).

We feed this result to awk to split the transcripts and lengths into separate arrays. We find the index of the longest length and report that transcript. We pass those transcripts to sort-bed to report a sorted BED file containing the longest-length transcripts in sorted order:

$ awk '{ \
    split($0, components, "|"); \
    transcriptsStr = components[1]; \
    lengthsStr = components[2]; \
    max = 0; \
    maxIdx = -1; \
    numLengths = split(lengthsStr, lengths, ";"); \
    for (idx = 1; idx <= numLengths; idx++) { \
        if (lengths[idx] > max) { \
            maxIdx = idx; \
            max = lengths[idx]; \
        } \
    } \
    split(transcriptsStr, transcripts, ";"); \
    maxTranscript = transcripts[maxIdx]; \
    print maxTranscript; \
}' perContigTranscriptsWithSizes.txt \
| sort-bed - > answer.bed

Putting this all together into one pipeline:

$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed \
    | bedmap --echo-map --echo-map-size - inputs.bed \
    | awk '{ \
        split($0, components, "|"); \
        transcriptsStr = components[1]; \
        lengthsStr = components[2]; \
        max = 0; \
        maxIdx = -1; \
        numLengths = split(lengthsStr, lengths, ";"); \
        for (idx = 1; idx <= numLengths; idx++) { \
            if (lengths[idx] > max) { \
                maxIdx = idx; \
                max = lengths[idx]; \
            } \
        } \
        split(transcriptsStr, transcripts, ";"); \
        maxTranscript = transcripts[maxIdx]; \
        print maxTranscript; \
    }' - \
    | sort-bed - > answer.bed
ADD COMMENT
1
Entering edit mode

Thank you so much for the pipeline.It worked wonderfully well. I have made few modifications. See below.

grep ";" perContigTranscriptsWithSizes.txt > overlapped.bed 

awk '{split($0, components, "|"); transcriptsStr = components[1]; lengthsStr = components[2]; max = 0; maxIdx = -1; numLengths = split(lengthsStr, lengths, ";"); for (idx = 1; idx <= numLengths; idx++) { if (lengths[idx] > max) {maxIdx = idx; max = lengths[idx];}} split(transcriptsStr, transcripts, ";"); maxTranscript = transcripts[maxIdx]; print maxTranscript;}' overlapped.bed > overlapped.filtered.bed 

sort-bed overlapped.filtered.bed > overlapped.filtered.sorted.bed 

cut -f 4 overlapped.filtered.bed > overlapped.filtered.genes 

grep -f overlapped.filtered.genes input.sorted.bed > overlapped.filtered.genes.bed 

grep -v -c ";" perContigTranscriptsWithSizes.txt > non_overlapped_final.bed 

cat non_overlapped_final.bed overlapped.filtered.genes.bed > final.bed

Hope this is ok. Thanks anyway for all the help with this. Now here is the new image after uploading final.bed file onto UCSC GB. Thanks

Upendra enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 1655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6