I would like to introduce pairedBamToBed12, a tool that we created at RIKEN some time ago to represent our paired-end transcriptome data with one single line per pair.
It is based on bedtools; technically, it is a fork with the command pairedBamToBed12 added and all the other commands removed (which means that it might be merged if there would be enough interest). Our source code is on GitHub.
There is a bamtobed tool in bedtools, but it was not fitting our needs, because bamtobed -split
leaves the forward (Read 1) and reverse (Read 2) reads on separate BED12 lines, and the BEDPE format output by bamtobed -bedpe
does not support spliced alignments.
As a brief illustration of what it does:
Read 1: >>>>>>>>>>>>
Read 2: <<<<<<<<<<<<<-----<<<<<<<
The pair: >>>>>>>>>>>>------>>>>>>>>>>>>>----->>>>>>>
Perhaps the best way to see further what the program does is to look at our regression tests. Our main use for it is to represent our paired-end CAGE data (CAGEscan), before upload to our home-made genome browser, Zenbu, that can represent BED12 files either as conventional intervals, or as quantitative coverage plots of the whole area or the 5' or 3' end (the 5' being particularly relevant for CAGE).
The main limitation of our approach is that it is strongly tied to proper pairing, in particular it can not represent transcripts overlaping multiple chromosomes, as in the case of recombinations, viral insertions, trans-splicing etc. This said, it is not a big problem for projects that are not requiring exploration of de novo transcript patterns. We are currently considering to support the optional use of one read only in case of non-proper pairing, as a compromise workaround.
pairedBamToBed12 is Free software (GPL-2 like bedtools), and I would be excited if it had more users and developers, which is why am writing this post :)
This said, if there is a superior solution, either already implemented or not, I will be very interested to discuss it. In particular, I wonder if in the long term, in order to support recombinations not represented in the reference genome used for alignment, it would be needed to give up on the simplicity of having one pair per line, and switch to a different format such as GFF...
-- Charles Plessy, Tsurumi, Kanagawa, Japan (working at RIKEN, see population-transcriptomics.org).
I just released version 1.1, that adds a new option to match read names that differ after a given separator (for instance if Read1 and Read2 got differetnt flags added to the name field in the FASTQ files during quality controls or other processing steps).
I just released version 1.2, that adds a new experimental option to correct for "G addition". Comments are welcome, especially on better ways to solve the problem.