Question

How to filter transcripts with overlapping exons?

0

Entering edit mode

10.0 years ago

upendrakumar.devisetty ▴ 400

When I was going through the bed file, I found that some of the entries define overlapping exons. One example is :

This is UCSC bed file format:

A01    5943412    5943616    Bra1000073    168    +    5943412    5943616    255,0,0    2    90,124    0,80

where you can see that the first exon goes from 0-90 and the second one from 80-124.

I am wondering is there a way to filter these kind of transcripts?

Thanks
Upendra

bed RNA-Seq • 4.3k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 10.0 years ago by upendrakumar.devisetty ▴ 400

0

Entering edit mode

10.0 years ago

Alex Reynolds 36k

I should first look up what the fields in a BED12 actually mean, instead of guessing wrongly. (I'm assuming your input is also BED12, per UCSC specification.)

Here's a text file called split.awk:

BEGIN {}
{
    chr = $1;
    start = $2;
    stop = $3;
    gene = $4;
    mis = $5;
    strand = $6;
    colour = $9;
    blockCount = $10;
    blockSizesStr = $11;
    blockStartsStr = $12;
    split(blockSizesStr, blockSizes, ",");
    split(blockStartsStr, blockStarts, ",");
    for (blockIdx = 1; blockIdx <= blockCount; ++blockIdx) {
        blockStart = blockStarts[blockIdx];
        blockSize = blockSizes[blockIdx];
        newStart = start + blockStart;
        newStop = newStart + blockSize;
        print chr"\t"newStart"\t"newStop"\t"gene"\t"mis"\t"strand"\t"start"\t"stop"\t"colour"\t"blockCount"\t"blockSizesStr"\t"blockStartsStr
    }
}
END {}

We can run the BED12 file through it:

$ awk -f split.awk some.bed12 | sort-bed - > answer.bed

I get the following answer with your input:

A01     5729384 5729665 Bra1000071      117     -       5729384 5730870 255,0,0 4       281,252,145,229 0,380,1030,1257
A01     5729764 5730016 Bra1000071      117     -       5729384 5730870 255,0,0 4       281,252,145,229 0,380,1030,1257
A01     5730414 5730559 Bra1000071      117     -       5729384 5730870 255,0,0 4       281,252,145,229 0,380,1030,1257
A01     5730641 5730870 Bra1000071      117     -       5729384 5730870 255,0,0 4       281,252,145,229 0,380,1030,1257

If you want to filter out overlapping elements, this could all be done in one line:

$ awk -f split.awk some.bed12 | sort-bed - | bedmap --count --echo --delim '\t' - | awk '$1==1' | cut -f2- > answer.bed

ADD COMMENT • link 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks a lot for your help. Learnt a lot during this exercise...

ADD REPLY • link 10.0 years ago by upendrakumar.devisetty ▴ 400

Ram · Accepted Answer · 2014-11-25

5

Entering edit mode

10.0 years ago

Alex Reynolds 36k

The first three fields are your overall chromosome, start and stop values.

Use awk to pull out the exon range fields, split on the comma value, and make a new BED file from the exon offsets and the overall values.

For your sample BED input above, you want something like this as output:

A01    5943412    5943502    Bra1000073    168    +    5943412    5943616    255,0,0    2    90,124    0,80
A01    5943492    5943616    Bra1000073    168    +    5943412    5943616    255,0,0    2    90,124    0,80

Once you have your BED file, run it through BEDOPS to strip overlapping elements:

$ sort-bed exons.bed > sorted_exons.bed
$ bedops -n sorted_exons.bed sorted_exons.bed > non_overlapping_sorted_exons.bed

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks Alex. Will try that....

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by upendrakumar.devisetty ▴ 400

0

Entering edit mode

Hi Alex, I tried to make a bed file as you said and used sort-bed to sort the bed file. But when I used bedops tool to extract the non overlapping bed it did not worked. Here is a subset of my bed file that contains both overlapping (Bra1000073) and non overlapping transcripts (Bra1000072, Bra1000074, Bra1000075).

A01     5731255 5731363 Bra1000072 94 - 5731255 5731906 255,0,0 2 108,82 0,569
A01     5731824 5731906 Bra1000072 94 - 5731255 5731906 255,0,0 2 108,82 0,569
A01     5943412 5943502 Bra1000073 168 + 5943412 5943616 255,0,0 2 90,124 0,80
A01     5943492 5943616 Bra1000073 168 + 5943412 5943616 255,0,0 2 90,124 0,80
A01     5999344 5999495 Bra1000074 18 + 5999344 5999805 255,0,0 2 151,48 0,413
A01     5999757 5999805 Bra1000074 18 + 5999344 5999805 255,0,0 2 151,48 0,413
A01     6092207 6092419 Bra1000075 218 - 6092207 6092600 255,0,0 2 212,101 0,292
A01     6092499 6092600 Bra1000075 218 - 6092207 6092600 255,0,0 2 212,101 0,292

Here is the command :

bedops -n sort_test3.bed  sort_test3.bed  > non_overlapping_sorted_exons.bed

The non_overlapping_sorted_exons.bed file is empty

Can you please try it and let me what is wrong with it.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by upendrakumar.devisetty ▴ 400

1

Entering edit mode

Do this, instead:

$ bedmap --echo-map sort_test3.bed | awk '{ if (split($0,a,";") == 1) print $0 }' > answer.bed

This gives me the answer:

A01    5731255    5731363    Bra1000072    94    -    5731255    5731906    255,0,0    2    108,82    0,569
A01    5731824    5731906    Bra1000072    94    -    5731255    5731906    255,0,0    2    108,82    0,569
A01    5999344    5999495    Bra1000074    18    +    5999344    5999805    255,0,0    2    151,48    0,413
A01    5999757    5999805    Bra1000074    18    +    5999344    5999805    255,0,0    2    151,48    0,413
A01    6092207    6092419    Bra1000075    218    -    6092207    6092600    255,0,0    2    212,101    0,292
A01    6092499    6092600    Bra1000075    218    -    6092207    6092600    255,0,0    2    212,101    0,292

ADD REPLY • link 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

It works. Great help. Much appreciated.

ADD REPLY • link 10.0 years ago by upendrakumar.devisetty ▴ 400

0

Entering edit mode

Another way to do it, which might be faster on very large inputs:

$ bedmap --count --echo --delim '\t' sort_test3.bed | awk '$1==1' | cut -f2- > answer.bed

ADD REPLY • link 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Alex, could you please share the awk script that you used to generate each line correspond to each exon. My python script works only well for 2 exons and anything more it does not work.

ADD REPLY • link 10.0 years ago by upendrakumar.devisetty ▴ 400

1

Entering edit mode

I didn't write a script, but it should be easy to do. Maybe you can use the following as a template and fill in the fields you need.

$ awk '{ \
    chr = $1; \
    start = $2; \
    stop = $3; \
    offsetStopStr = $11; \
    offsetStartStr = $12; \
    numStopOffsets = split(offsetStopStr, offsetStops, ","); \
    numStartOffsets = split(offsetStartStr, offsetStarts, ","); \
    for (offsetIdx = 1; offsetIdx <= numStartOffsets; ++offsetIdx) { \
        startOffset = offsetStarts[offsetIdx]; \
        stopOffset = offsetStops[offsetIdx]; \
        newStart = start + startOffset; \
        newStop = start + stopOffset; \
        print chr"\t"newStart"\t"newStop; \
    } \
}' input.bed12 > output.bed3

The logic in a Python script should be somewhat similar.

ADD REPLY • link 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks Alex for the awk script. I was able to modify my python script accordingly.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by upendrakumar.devisetty ▴ 400

0

Entering edit mode

Alex, the problem is not fixed yet. Here is the error...

awk '{chr = $1; start = $2; stop = $3; gene = $4; mis = $5; strand = $6; colour = $9; exons = $10; offsetStopStr = $11; offsetStartStr = $12; numStopOffsets = split(offsetStopStr, offsetStops, ","); numStartOffsets = split(offsetStartStr, offsetStarts, ","); for (offsetIdx = 1; offsetIdx <= numStartOffsets; ++offsetIdx) {startOffset = offsetStarts[offsetIdx]; stopOffset = offsetStops[offsetIdx]; newStart = start + startOffset; newStop = start + stopOffset; print chr"\t"newStart"\t"newStop"\t"gene"\t"mis"\t"strand"\t"start"\t"stop"\t"colour"\t"exons"\t"offsetStopStr"\t"offsetStartStr}}' velvet.test.bed > velvet.test.filtered.bed

# velvet.test.bed file

A01     5729384 5730870 Bra1000071      117     -       5729384 5730870 255,0,0 4       281,252,145,229 0,380,1030,1257

# velvet.test.filtered.bed file

A01    5729384    5729665    Bra1000071    117    -    5729384    5730870    255,0,0    4    281,252,145,229    0,380,1030,1257
A01    5729764    5729636    Bra1000071    117    -    5729384    5730870    255,0,0    4    281,252,145,229    0,380,1030,1257
A01    5730414    5729529    Bra1000071    117    -    5729384    5730870    255,0,0    4    281,252,145,229    0,380,1030,1257
A01    5730641    5729613    Bra1000071    117    -    5729384    5730870    255,0,0    4    281,252,145,229    0,380,1030,1257

# The error when i try to sort the file...

sort-bed velvet.test.filtered.bed > velvet.test.filtered.sorted.bed

Error on line 2 in velvet.test.filtered.bed. Genomic end coordinate is less than (or equal to) start coordinate.

Any ideas?

ADD REPLY • link 10.0 years ago by upendrakumar.devisetty ▴ 400