Hi everyone,
Using Trimmomatic and then HISAT2, I have aligned 300 RNA fastq samples (NovaSeq6000, RNA sequencing, paired-end, 150bp sequencing).
I have found a percentage of overlapping paired end reads (read through) in the 300 .bam files. I found the overlaps in IGV and QORTS. This is a splicing analysis. The splice junctions covered by overlapping PE reads will be counted twice by downstream programs (confirmed by developers).
clipOverlap from bamutils works to merge those reads but it throws away all discordant reads (uniquely mapped). About 200,000 a file.
Does anyone know if bbmerge can merge the overlapping paired end reads from a .bam file ? Or, alternatively recommend another tool/technique to do this?
Thanks!
Probably not directly. But you could name sort the BAM file and feed it to
reformat.sh
that can extract the read pairs and then pipe them into BBMerge.You could go back to the fastq files and merge PE using FLASH, but this means you'll have to rerun the mapping procedure - not sure if this is really an option.
Thanks @Genomax and @liorglic for the suggestions. We decided to proceed on with our alignment and use clipOverlap from bamutils for read overlap removal (post alignment) from our BAM files.
However the clipOverlap (https://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap) manual states :
"It also checks strand. If a forward strand extends past the end of a reverse strand, that will be clipped. Similarly, if a reverse strand starts before the forward strand, the region prior to the forward strand will be clipped. If the reverse strand occurs entirely before the forward strand, both strands will be entirely clipped. If the --unmapped option is specified, then rather than clipping an entire read, it will be marked as unmapped."
We understand the forward strand extending past the end of the reverse strand, but are unable to visualise what the statement in bold is specifically describing? How does the reverse strand start before the forward strand, but then clip this region prior to the forward strand - it sounds as though it is not the overlapping segment?
We also observed the third rule in action where discordant reads with no overlap were entirely clipped. Any thoughts as to why they would perform this action on reads that have no overlap? We have a minimal amount, but are interested to understand their theory to better understand the software in general.
Once again thank you for your time and happy holidays!