I should first look up what the fields in a BED12 actually mean, instead of guessing wrongly. (I'm assuming your input is also BED12, per UCSC specification.)
Here's a text file called split.awk
:
BEGIN {}
{
chr = $1;
start = $2;
stop = $3;
gene = $4;
mis = $5;
strand = $6;
colour = $9;
blockCount = $10;
blockSizesStr = $11;
blockStartsStr = $12;
split(blockSizesStr, blockSizes, ",");
split(blockStartsStr, blockStarts, ",");
for (blockIdx = 1; blockIdx <= blockCount; ++blockIdx) {
blockStart = blockStarts[blockIdx];
blockSize = blockSizes[blockIdx];
newStart = start + blockStart;
newStop = newStart + blockSize;
print chr"\t"newStart"\t"newStop"\t"gene"\t"mis"\t"strand"\t"start"\t"stop"\t"colour"\t"blockCount"\t"blockSizesStr"\t"blockStartsStr
}
}
END {}
We can run the BED12 file through it:
$ awk -f split.awk some.bed12 | sort-bed - > answer.bed
I get the following answer with your input:
A01 5729384 5729665 Bra1000071 117 - 5729384 5730870 255,0,0 4 281,252,145,229 0,380,1030,1257
A01 5729764 5730016 Bra1000071 117 - 5729384 5730870 255,0,0 4 281,252,145,229 0,380,1030,1257
A01 5730414 5730559 Bra1000071 117 - 5729384 5730870 255,0,0 4 281,252,145,229 0,380,1030,1257
A01 5730641 5730870 Bra1000071 117 - 5729384 5730870 255,0,0 4 281,252,145,229 0,380,1030,1257
If you want to filter out overlapping elements, this could all be done in one line:
$ awk -f split.awk some.bed12 | sort-bed - | bedmap --count --echo --delim '\t' - | awk '$1==1' | cut -f2- > answer.bed
Thanks Alex. Will try that....
Hi Alex, I tried to make a bed file as you said and used sort-bed to sort the bed file. But when I used bedops tool to extract the non overlapping bed it did not worked. Here is a subset of my bed file that contains both overlapping (Bra1000073) and non overlapping transcripts (Bra1000072, Bra1000074, Bra1000075).
Here is the command :
The
non_overlapping_sorted_exons.bed file
is emptyCan you please try it and let me what is wrong with it.
Do this, instead:
This gives me the answer:
It works. Great help. Much appreciated.
Another way to do it, which might be faster on very large inputs:
Alex, could you please share the awk script that you used to generate each line correspond to each exon. My python script works only well for 2 exons and anything more it does not work.
I didn't write a script, but it should be easy to do. Maybe you can use the following as a template and fill in the fields you need.
The logic in a Python script should be somewhat similar.
Thanks Alex for the awk script. I was able to modify my python script accordingly.
Alex, the problem is not fixed yet. Here is the error...
# velvet.test.bed file
# velvet.test.filtered.bed file
# The error when i try to sort the file...
Error on line 2 in velvet.test.filtered.bed. Genomic end coordinate is less than (or equal to) start coordinate.
Any ideas?