Question

When to merge BAM or SAM files?

0

Entering edit mode

2.7 years ago

Thanh • 0

Hi, I'm new to bioinformatics and am working on some WES raw data of NEPC tumor organoids and patient derived xenografts. I have aligned the reads against the human genome version 38. I am following the workflow of GDC: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/ to call variants. There is a step where the bam files are merged. I was wondering what the rationale behind merging the BAM files is.

WES MergeSamFiles picard • 1.5k views

ADD COMMENT • link updated 2.7 years ago by Prash ▴ 280 • written 2.7 years ago by Thanh • 0

0

Entering edit mode

Thanh, it is likely that the bam files need to be merged when they are sorted. This is don e automatically when the merge/sort options are rendered following which all filenames with suffix .0001.bam, .0002.bam etc. would be combined to one bam file.

And I see there is no point on read groups here unless you are doing a WTSS.

ADD REPLY • link 2.7 years ago by Prash ▴ 280

Ram · Answer 1 · 2022-03-25

According to the link you provided,

Each read group is aligned to the reference genome separately and all read group alignments that belong to a single aliquot are merged.

It seems that each sample/aliquot was sequenced several times (e.g. different libraries, lanes, runs) and, as such, each one is assigned a different read group that is mapped separately. However, since all aliquots come from the same sample, you want to merge them so you can use all data for your downstream analyses.

score 0 · Answer 2 · 2022-03-25

0

Entering edit mode

2.7 years ago

Diana G. ▴ 30

For some analysis you need certain amount of minimal reads is order to see what you are looking for. This is the reason why the same sample is sequenced several times, so you can reach this amount. The files can be merged from the fasta file or later in the pipeline of your analysis. In this case, as you say, bam/sam files.

ADD COMMENT • link 2.7 years ago by Diana G. ▴ 30

0

Entering edit mode

A read group corresponds to some physical pool of reads. In older sequencing machines, WGS data (and even WXS) requires combination of many read groups to reach the desired coverage. The newer platform normally enables much larger read group, so this is often not necessary, or much less read groups are needed.

Another complication is that some sequencing centers are contracted to provide a certain coverage to get paid. In order to not "waste" their reagent and sequencing capacity, they might intentionally make smaller "top-off" read groups in order to satisfy the total coverage requirements.

ADD REPLY • link 2.7 years ago by Zhenyu Zhang ★ 1.2k

0

Entering edit mode

This answers a different question ("What are read groups and when does merging them make sense?"). It does not relate to BAM files directly. I think this post belongs as a comment on Diana's answer so I will move it there. Please let me know if you can add some more context and make it a standalone answer.

ADD REPLY • link 2.7 years ago by Ram 44k