Merging overlapping mates in a BAM / SAM file into one read
0
0
Entering edit mode
3.8 years ago
lechu ▴ 20

Hi All, I need to merge every forward read of a PE sequencing experiment with its corresponding mate, whenever a pair is overlapping. I am aware of similar threads (e.g., this), but none provide a satisfactory solution. I know that there are tools to merge reads without alignment, but I would like to use alignment information to improve the fidelity of the merge. I tried aftermerge.py, but without success (errors, see below). I would be happy to hear about potential other ways to accomplish the task, or from someone who had more luck with aftermerge.

Aftermerge error 1:

[E::sam_parse1] CIGAR and query sequence are of different length 
[W::sam_read1] Parse error at line 70
[main_samview] truncated file.

Aftermerge error 2:

➜  test_map_full samtools view -h TI-2680-RTA10_S16_interleaved.fastq_slamdunk_mapped.bam | aftermerge.py - output.sam Traceback (most recent call last):   File "/Users/lecka48/scripts/aftermerge.py", line 219, in <module>
    (newseq, newqual, newmcigar) = mergeSequences(seq1, seq2, qual1, qual2, mcigar1, mcigar2, refstart, materead.pos)   File "/Users/lecka48/scripts/aftermerge.py", line 88, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   File "/Users/lecka48/scripts/aftermerge.py", line 66, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   File "/Users/lecka48/scripts/aftermerge.py", line 66, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   File "/Users/lecka48/scripts/aftermerge.py", line 66, in mergeSequences
    (newseq, newqual, newmcigar) = mergeSequences(seq1[1:], seq2[1:], qual1[1:], qual2[1:], mcigar1[1:], mcigar2[1:], start1+1, start2+1)   [Previous line repeated 67 more times]   File "/Users/lecka48/scripts/aftermerge.py", line 64, in mergeSequences
    if (mcigar1[0] == mcigar2[0]): IndexError: string index out of range

Finally, I'd like to highlight I am looking for a program that includes information about the alignment to reference genome (coordinates of each read) for merging. All methods described here Tools to merge overlapping paired-end reads seem to assemble the reads directly from fastq files, using the overlap between mates to do the job.

RNA-Seq • 1.2k views
ADD COMMENT
0
Entering edit mode

You should go back and check your alignments to see why there are differences in CIGAR and query. Try a different aligner out. Perhaps aftermerge requires an older format (SAM v.1.3?)

ADD REPLY
0
Entering edit mode

Thanks Genomax for your help! The alignments are fine, the problem occurs with the SAM file produced by merging. First, it does not have a header, and second, the CIGARs are problematic:

> ERROR:INVALID_CIGAR   12012
> ERROR:MISMATCH_CIGAR_SEQ_LENGTH   22025
> ERROR:MISMATCH_FLAG_MATE_NEG_STRAND   1586 
> ERROR:MISSING_READ_GROUP  1
> ERROR:MISSING_SEQUENCE_DICTIONARY 50700
> WARNING:ADJACENT_INDEL_IN_CIGAR   7 
> WARNING:MISSING_TAG_NM    42768
> WARNING:RECORD_MISSING_READ_GROUP 63875

It might be that a different aligner would work, but I have to use NextGenMap for that. I may try bbmerge. The aim here is to deal with the pair overlaps to avoid counting variants twice. I can't believe this is not a more commonly used approach, and there is such a scarcity of tools.

ADD REPLY

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6