Hi there! I am trying to merge overlapping reads in a bam file to one in silico long read. An example:
chr1 1 4 ACAC
chr1 3 6 ACTT
chr1 6 9 TGGG
chr1 11 14 ACTG
The desired result is:
chr1 1 9 ACACTTGGG
chr1 11 14 ACTG
I tried bedtools merge
which gives me the start and end position of the merged reads, even the count and read identifiers of reads that overlap and were merged, but unfortunately not the actual merged read sequence. When trying to use -c 4 -o collapse -delim ""
the sequences unfortunately just get concatenated and therefore some bases are doubled, since they are not observed back to back, but overlapping.
I also read into getting consensus sequence from bam files and clipping of overlapping read pair mates, but all that is not what I want to achieve. Any help is appreciated! Thanks!
thanks rbagnall, but do you mean I should go back to the ref genome and get the sequence from there? unfortunately, that is not possible, since I actually work with the SNPs covered by the overlapping reads.
Can you get the reference genome to which your BAM file is aligned? I'm not sure how you could join together the sequence using only the example format from above.
Yes, I have the reference genome, and I know the heterozygous SNPs of it. However, those SNPs are captured by the short reads in my bam which I want to merge. Taking the start and end position of the merged reads and getting the respective sequence from the reference genome removes the information of covered SNPs in my bam.
Hi Julia, Have you had any progress in merging reads in the way you described? I have a similar problem and tried aftermerge (https://github.com/KoenHerten/aftermerge), but so far with no success.