Hello!
I have paired-end data with adapters which contain random sequences (UMIs). The problem is the 3' read trimming when the insert size is smaller than the read.
I would like to perform read trimming (or soft-clipping) by using the mapping coordinates of the read mate. (i.e. when Read1 End is greater than Read2 Start trim/soft-clip the 3' end bases exceeding the real insert).
-------------|-->
<--|-------------
I could write code for this but I was wondering if anyone knows an already existing tool which enables trimming reads by coordinate from a bam file.
Thanks in advance,
Pau
You should be able to use
bbmerge.sh
which is part of BBMap suite with the following option set tot
(it is false by default). You will get a merged representation of the read though at the end.Thanks GenoMax!
But I really need both reads clipped separately...
Think this program by @Pierre may fit the bill then: A: Remove Soft Clipped Bases or the second answer in the same thread.
You could also take the merged read from
bbmerge
and then pull individual reads/compare and clip them from R1/R2 files using custom code (you will need to RC R2 read).I don't think my program is suitable for this task (it just removes the clipped bases)
I was going by this request in original post.
Then the clipped BAM can be converted back to fastq after using your program?
Pierre's tool removes clipped bases... My bases would need first to be clipped (by insert-size/coordinate)
Thanks for your suggestions GenoMax!
Using bbmerge as an intermediate to then pull out the trimmed seq by comparing sequences with original fastqs, could be an option, although having to read and align twice the data (merge step + align seqs to original fastqs) might not be very efficient.
That's why I thought about using mapping info to reference genome... From bam file I could first subset the small fraction of reads which fall in the problematic scenario (by using insert size). Then, if no tool is available to trim by insert-size, I could iterate through these problematic read subset and discard last [read length - insert-size] bases... Finally I can merge again the modified subset to the original bam.