Hi,
I'm new to bioinformatics and am working on some WES raw data of NEPC tumor organoids and patient derived xenografts. I have aligned the reads against the human genome version 38. I am following the workflow of GDC:
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/
to call variants. There is a step where the bam files are merged. I was wondering what the rationale behind merging the BAM files is.
Thanh, it is likely that the bam files need to be merged when they are sorted. This is don e automatically when the merge/sort options are rendered following which all filenames with suffix .0001.bam, .0002.bam etc. would be combined to one bam file.
And I see there is no point on read groups here unless you are doing a WTSS.
Each read group is aligned to the reference genome separately and all read group alignments that belong to a single aliquot are merged.
It seems that each sample/aliquot was sequenced several times (e.g. different libraries, lanes, runs) and, as such, each one is assigned a different read group that is mapped separately. However, since all aliquots come from the same sample, you want to merge them so you can use all data for your downstream analyses.
ADD COMMENT
• link
updated 2.7 years ago by
Ram
44k
•
written 2.7 years ago by
FGV
▴
170
For some analysis you need certain amount of minimal reads is order to see what you are looking for. This is the reason why the same sample is sequenced several times, so you can reach this amount. The files can be merged from the fasta file or later in the pipeline of your analysis. In this case, as you say, bam/sam files.
A read group corresponds to some physical pool of reads. In older sequencing machines, WGS data (and even WXS) requires combination of many read groups to reach the desired coverage. The newer platform normally enables much larger read group, so this is often not necessary, or much less read groups are needed.
Another complication is that some sequencing centers are contracted to provide a certain coverage to get paid. In order to not "waste" their reagent and sequencing capacity, they might intentionally make smaller "top-off" read groups in order to satisfy the total coverage requirements.
This answers a different question ("What are read groups and when does merging them make sense?"). It does not relate to BAM files directly. I think this post belongs as a comment on Diana's answer so I will move it there. Please let me know if you can add some more context and make it a standalone answer.
Thanh, it is likely that the bam files need to be merged when they are sorted. This is don e automatically when the merge/sort options are rendered following which all filenames with suffix .0001.bam, .0002.bam etc. would be combined to one bam file.
And I see there is no point on read groups here unless you are doing a WTSS.