Question

Merging overlapping mates in a BAM / SAM file into one read

0

Entering edit mode

4.1 years ago

lechu ▴ 20

Hi All, I need to merge every forward read of a PE sequencing experiment with its corresponding mate, whenever a pair is overlapping. I am aware of similar threads (e.g., this), but none provide a satisfactory solution. I know that there are tools to merge reads without alignment, but I would like to use alignment information to improve the fidelity of the merge. I tried aftermerge.py, but without success (errors, see below). I would be happy to hear about potential other ways to accomplish the task, or from someone who had more luck with aftermerge.

Aftermerge error 1:

[E::sam_parse1] CIGAR and query sequence are of different length 
[W::sam_read1] Parse error at line 70
[main_samview] truncated file.

Aftermerge error 2:

➜  test_map_full samtools view -h TI-2680-RTA10_S16_interleaved.fastq_slamdunk_mapped.bam | aftermerge.py - output.sam Traceback (most recent call last):   File "/Users/lecka48/scripts/aftermerge.py", line 219, in <module>
    (newseq, newqual, newmcigar) = mergeSequences(seq1, seq2, qual1, qual2, mcigar1, mcigar2, refstart, materead.pos)   File "/Users/lecka48/scripts/aftermerge.py", line 88, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   File "/Users/lecka48/scripts/aftermerge.py", line 66, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   File "/Users/lecka48/scripts/aftermerge.py", line 66, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   File "/Users/lecka48/scripts/aftermerge.py", line 66, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   [Previous line repeated 67 more times]   File "/Users/lecka48/scripts/aftermerge.py", line 64, in mergeSequences
    if (mcigar1[0] == mcigar2[0]): IndexError: string index out of range

Finally, I'd like to highlight I am looking for a program that includes information about the alignment to reference genome (coordinates of each read) for merging. All methods described here Tools to merge overlapping paired-end reads seem to assemble the reads directly from fastq files, using the overlap between mates to do the job.

RNA-Seq • 1.3k views

ADD COMMENT • link 4.1 years ago by lechu ▴ 20

0

Entering edit mode

You should go back and check your alignments to see why there are differences in CIGAR and query. Try a different aligner out. Perhaps aftermerge requires an older format (SAM v.1.3?)

ADD REPLY • link 4.1 years ago by GenoMax 150k

0

Entering edit mode

Thanks Genomax for your help! The alignments are fine, the problem occurs with the SAM file produced by merging. First, it does not have a header, and second, the CIGARs are problematic:

> ERROR:INVALID_CIGAR   12012
> ERROR:MISMATCH_CIGAR_SEQ_LENGTH   22025
> ERROR:MISMATCH_FLAG_MATE_NEG_STRAND   1586 
> ERROR:MISSING_READ_GROUP  1
> ERROR:MISSING_SEQUENCE_DICTIONARY 50700
> WARNING:ADJACENT_INDEL_IN_CIGAR   7 
> WARNING:MISSING_TAG_NM    42768
> WARNING:RECORD_MISSING_READ_GROUP 63875

It might be that a different aligner would work, but I have to use NextGenMap for that. I may try bbmerge. The aim here is to deal with the pair overlaps to avoid counting variants twice. I can't believe this is not a more commonly used approach, and there is such a scarcity of tools.

ADD REPLY • link 4.1 years ago by lechu ▴ 20