Okay, start by making a contigs file:
$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed > contigs.bed
Now use bedmap
with both --echo-map
and --echo-map-size
to report two pieces of information: a list of mapped transcripts (which overlap the contig) and their associated lengths:
$ bedmap --echo-map --echo-map-size contigs.bed inputs.bed > perContigTranscriptsWithSizes.txt
For each contig, this format is as follows:
transcript-1 ; transcript-2 ; ... ; transcript-N | length-1 ; length-2 ; ... ; length-N
Note that the transcripts and lengths are separated from each other by the pipe character (|
), while the lists of individual transcripts and lengths are separated by semi-colons (;
).
We feed this result to awk
to split the transcripts and lengths into separate arrays. We find the index of the longest length and report that transcript. We pass those transcripts to sort-bed
to report a sorted BED file containing the longest-length transcripts in sorted order:
$ awk '{ \
split($0, components, "|"); \
transcriptsStr = components[1]; \
lengthsStr = components[2]; \
max = 0; \
maxIdx = -1; \
numLengths = split(lengthsStr, lengths, ";"); \
for (idx = 1; idx <= numLengths; idx++) { \
if (lengths[idx] > max) { \
maxIdx = idx; \
max = lengths[idx]; \
} \
} \
split(transcriptsStr, transcripts, ";"); \
maxTranscript = transcripts[maxIdx]; \
print maxTranscript; \
}' perContigTranscriptsWithSizes.txt \
| sort-bed - > answer.bed
Putting this all together into one pipeline:
$ sort-bed inputs.unsorted.bed > inputs.bed
$ bedops --merge inputs.bed \
| bedmap --echo-map --echo-map-size - inputs.bed \
| awk '{ \
split($0, components, "|"); \
transcriptsStr = components[1]; \
lengthsStr = components[2]; \
max = 0; \
maxIdx = -1; \
numLengths = split(lengthsStr, lengths, ";"); \
for (idx = 1; idx <= numLengths; idx++) { \
if (lengths[idx] > max) { \
maxIdx = idx; \
max = lengths[idx]; \
} \
} \
split(transcriptsStr, transcripts, ";"); \
maxTranscript = transcripts[maxIdx]; \
print maxTranscript; \
}' - \
| sort-bed - > answer.bed
Please post an example of "wanted output".
Ok all. I have edited my post to give more information of what i wanted. Basically i have bed file that have four overlapping transcripts (see Denovo filtered track in the picture) and all i wanted is to have a consensus/overlapping contig. For this i have used mergedBed/bedops tools but both of them gave me a bed file that have a single contig without preserving intron-exon structure (See merged Track above). Am i not thinking well here? I don't know if it can be done with any tool.
Can you write up a sample BED file that hints at the format of what you need, given the example images provided? It's likely the use of
bedops
andbedmap
alone won't do what you need, but perhaps those two apps withawk
may work. Hopefully this helps clarify the request for clarification.This is what want
The
desired bed
category looks like the fourth line ofinput
.That is right in this case. Basically all i wanted is for those overlapped transcripts pick the one that is longest without merging the introns and exons.