Question

Does BWA mem ever discard or modify data

0

Entering edit mode

7.1 years ago

e.benn ▴ 110

I could not find a definite answer to this question. Does BWA mem modify the data it receives? I understand that it adds data (alignment of course, and more), but am wondering if the data in the fastq is guaranteed to survive.

The program takes in read names, a sequence string, and a quality string. Does it ever modify any of those?

The reason I ask is I am looking to compress our fastq data by running BWA, discarding as many unneeded sam columns and tags, and compressing to CRAM. I have run various tests, and it seems to be working fine, but wanted to know for sure.

[ed: clarify question and goals]

bwa • 4.2k views

ADD COMMENT • link 7.1 years ago by e.benn ▴ 110

2

Entering edit mode

It is not the right question - BWA produces a SAM format - so the question should be "Is the sequence information in a SAM format identical to the FASTQ sequence" - to which the answer is NO. The SAM format represents information on the forward strand, hence sequences mapping on the reverse strand will be reverse complemented in the SAM format.

Perhaps the question is whether the FASTQ information could always be reconstructed from the SAM file - to which is the answer is again that not always. If the alignment is reported hard clipped then the clipped sequence will not be in the SAM file.

ADD REPLY • link 7.1 years ago by Istvan Albert 102k

1

Entering edit mode

running BWA, discarding all but the name, sequence and quality

So why don't you archive fastq in the first place? I don't see the reason to align and then discard the work the aligner has done. If you really want to store unaligned CRAM instead of fastq then you can convert fastq to unaligned SAM with e.g. picard/FastqToSam.

ADD REPLY • link 7.1 years ago by dariober 15k

0

Entering edit mode

discarding all but the name, sequence and quality, and compressing with CRAM

In fact, he can't do what he wants, because CRAM is an alignment format, and the compression is based on the alignment to the reference genome. If he discards everything but the name, sequence and quality, he can't save as CRAM.

ADD REPLY • link 7.1 years ago by h.mon 35k

0

Entering edit mode

Yes very good, I also need to keep the alignment, and the 'which pair-end' metadata, and probably some more I have forgotten.

ADD REPLY • link 7.1 years ago by e.benn ▴ 110

5

Entering edit mode

7.1 years ago

andrew.j.skelton73 6.6k

What you might be looking for is uBAM - This is used in GATK's production pipeline to attach metadata to samples as early as possible, however the Broad's scale of data that they're analysing is far more than most places. The biggest caveat to this approach is outlined by Brian Bushnell (Author of bbmap), on that thread, in that gzipped fastq is much more practical in terms of (de)compression efficiency and resulting storage footprint, not to mention that by forcing your reads to the SAM spec, you may be losing original read names, and that can cause problems with tools that assume the original illumina read names (ie, paired end data could be parsed as single end).

I was thinking along the same lines as you a few years ago, and opted against uBAM and stuck with gzipped fastq, however saying that, I recall a thread about loseless fastq compression here. The tool (Alapy) that @Petr Ponomarenko outlined is available here, and there's a fantastic discussion on the thread of benchmarking and feature requests, worth a read.

Additional methodologies out there are binary fastq as @chris.bird mentioned:

@John's uQ tool
@Brian Bushnell's Clumpify tool (from bbmap)

Overall, my advice is to avoid uBAM, and check out lossless compression methods like alapy, uQ, or Clumpify, and see what best fits your needs, as there will be tradeoffs in terms of memory usage, and time to execute per file.

ADD COMMENT • link 7.1 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Thanks Andrew for that detailed answer, I will have a close look at all this

ADD REPLY • link 7.1 years ago by e.benn ▴ 110

0

Entering edit mode

No matter what compressed storage format you use, the majority of the entropy in sequence data is actually in base quality scores, not the actual sequences. If you want 50%+ improvement over fq.gz then you'll be forced to discard information - base quality score binning will make a bigger difference to your storage footprint that your choice between fq.gz, bam and cram. Whether this level of loss is acceptable depend on your requirements.

ADD REPLY • link 7.1 years ago by d-cameron ★ 3.0k

0

Entering edit mode

7.1 years ago

chris.bird ▴ 10

it can clip bases, you'll want to relax the defaults

reads could also be filtered if they don't map to the reference genome

given that a bam file has more info than a fastq file, this probably won't work

search on binary fastq for a better solution

ADD COMMENT • link 7.1 years ago by chris.bird ▴ 10

0

Entering edit mode

it can clip bases, you'll want to relax the defaults

The bases are soft clipped so the information is still there. Only supplementary alignments are hard clipped but the information is in the primary alignment anyway.

reads could also be filtered if they don't map to the reference genome

Unmapped reads are retained in the bwa output. They just have the read unmapped SAM flag set.

given that a bam file has more info than a fastq file, this probably won't work

bwa alignment does not remove any data from the input fastq. Even the record ordering is retain in the bwa output.

ADD REPLY • link 7.1 years ago by d-cameron ★ 3.0k

score 3 · Accepted Answer · 2018-06-14

I believe you can keep all information from a fastq file if you use bwa mem with default parameters. The resulting sam will contain the unmapped reads, and only one copy of the mapped reads will contain the "primary" flag - additional mappings will be marked as "secondary". By default, bwa does soft clipping of primary reads, so all sequence and quality is kept. And even if original fastq read names are edited - removing /1 and /2, or whatever other convention used - the reads will contain the "first in pair" and "second in pair" flags, so read names can be recovered in full.

A great tool to check my claim is BamHash.

edit: this is not to say storing reads as CRAM files is a good idea, but some people think for long-term storage indeed it is. See CRAM vs BAM for references, and for some benchmarks:

https://www.uppmax.uu.se/support/user-guides/using-cram-to-compress-bam-files/

http://www.htslib.org/benchmarks/CRAM.html