bwa and bowtie2 bamfile format
2
0
Entering edit mode
8.1 years ago
nkinney06 ▴ 140

I'm using a couple programs to look at repetitive DNA in mapped reads (bamfiles): genotan and repeatseq. Both programs have publications and are designed to work on bamfiles.

The problem is that I'm getting segmentation faults from both programs on some of the bamfiles I would like to analyze. It's hard to tell for sure why some bamfiles run successfully while others seg fault. There does appear to be a much greater tendency to seg fault on any bamfile created with bwa. Im wondering if there are any formatting differences between bwa and bowtie2 mapped bamfiles and if there is a way to repair the files that fail without remapping. Perhaps its a whitespace issue or special character issue. Debugging the programs myself is probably unfeasible so Im looking for any other solution, Thanks

bam bowtie2 bwa • 2.9k views
ADD COMMENT
0
Entering edit mode

Try running your script with providing more memory.Segmentation faults could be because of insufficient memory.

ADD REPLY
2
Entering edit mode
8.1 years ago
d-cameron ★ 2.9k

The SAM file format specifications (which also define the binary BAM equivalent) are quite flexible so it is quite easy to write a SAM/BAM file that is valid according to the specifications, but the program processing the bam file considers that invalid input. Some example include:

  • Writing multiple records for the same read (eg bwa by default does split read alignment which, if the downstream program is expecting 1 per read, could exhibit as a program crash only if both split alignments are aligned to the same repeat).

  • Read that align before or after the start/end of a chromosome

  • Different interpretations of SAM flags (eg bwa sets the "0x2 each segment properly aligned according to the aligner" for paired reads that are aligned to the same chromosome in the correct orientation regardless of how far apart they the align)

  • bwa hard clips the split reads, whereas bowtie2 does not use the hard clipping CIGAR operator

    • The additional SAM tags written by bwa and bowtie2 are different.

In short, just because something is a valid SAM/BAM file, doesn't mean the downstream tools know what to do with it. My guess is that your programs are crashing on a particular input edge case that they weren't designed to handle.

ADD COMMENT
0
Entering edit mode

This is potentially a very helpful lead. Do you happen to know of any filtering program? Id like to simply remove some of these cases that you mention from my bamfiles and see if I have better luck. I may try some of the available perl or python bamreaders to quickly write something from scratch but Im not sure these tools will be sufficient and writing a filter in C++ could be quite an untertaking. Thanks

ADD REPLY
1
Entering edit mode

Have you eliminated the other possible explanations (malformed BAMs or memory limits)? Filtering edge cases will not solve either of those problems.

ADD REPLY
0
Entering edit mode

I don't think its memory limitations and I would have trouble adjusting this because the C source code is rather challenging. I could look into bam validation with the tool you mentioned, but it seems like the programs may be running until they "hit" one the problem reads that cause a seg fault. I may not be too difficult to write a perl script to filter out a couple of the special cases mentioned here. There's also cases of bams were one program works but the other fails. Its a little frustrating since these programs are published but at least I have some directions now.

ADD REPLY
0
Entering edit mode
8.1 years ago

Probably a memory issue per @Ron, but you can run BamUtilities 'validate' or Picard's ValidateSamFile to assess your BAMs.

ADD COMMENT

Login before adding your answer.

Traffic: 1894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6