Question

About GATK data-preprocessing workflow

0

Entering edit mode

5.9 years ago

9521ljh ▴ 50

I have fastq files that i want to make BAM files.

In GATK workflow of pre-processing, uBAM(unmapped bam)file is necessary because it have metadata.

Thus, i did

Fastq -> BWA - mapped BAM

Fastq -> Picard - uBAM

uBAM + mapped BAM -> Picard - Merge

However, i really don't know why this process is needed. Because we can add metadata to BAM with Picard(Addorreplacereadgroups) instead of using uBAM

i already read this article: https://gatkforums.broadinstitute.org/gatk/discussion/11694/why-is-converting-from-fastq-to-ubam-nesessary-before-preprocessing#latest

assembly next-gen GATK Preprocessing uBAM • 2.1k views

ADD COMMENT • link updated 5.9 years ago by benformatics 4.1k • written 5.9 years ago by 9521ljh ▴ 50

score 2 · Answer 1 · 2019-05-28

2

Entering edit mode

5.9 years ago

benformatics 4.1k

The metadata is not related to the read groups.

As the skywarrior person said in the post you linked:

BWA hardclips reads if there is a significant discordance between the best matching kmer and the read. These hardclips may end up costing you a particular structural variant or a true indel call. Merging unmapped bam and initial alignment restores the hardclips which I know of no solution for that in BWA parameters.

Thus you are not really losing metadata... you are potentially losing actual data from your original sequencing reads. This step may be unnecessary depending on the type of dataset you have (Exome vs. Whole genome) or furthermore maybe you don't care about certain structural variants and/or know that they aren't present in your dataset.

ADD COMMENT • link 5.9 years ago by benformatics 4.1k

0

Entering edit mode

Thank you for reply.

could you explain example of metadata??.. i just thought it was like platform(illumina), library, Sample_NAMe...

but all of these is included AddorReplacegroups options.

ADD REPLY • link 5.9 years ago by 9521ljh ▴ 50

0

Entering edit mode

Yes those are examples of metadata... but the issue here is that you are excluding the core of your data (i.e. nucleotide sequence) because of an underlying aspect of the bwa software. This is completely independent of any meta-deta.

ADD REPLY • link 5.9 years ago by benformatics 4.1k