Question

In the NGS pipeline, why read are sorted before marking duplicates?

0

Entering edit mode

3.7 years ago

ManuelDB ▴ 110

I am creating my own NGS pipeline (from Illumina fastq to vcf file). I am using best practices GATK and the pipeline already created in the clinical lab I am working.

I have seen that the fastq is converted to sam (mapping included) and then the following lines of code are

java -Xmx4000m "$javatmp" -jar "$picardpath" SortSam \ INPUT=/home/mdb1c20/my_onw_NGS_pipeline/files/sam/1.sam \ OUTPUT=/home/mdb1c20/my_onw_NGS_pipeline/files/bam/1_sorted.bam \ SORT_ORDER=coordinate \ COMPRESSION_LEVEL=5

after this, I have seen that duplicates are marked which means take the best one and remove duplicates.

My questions are:

Why reads are sorted? Efficiency ?? In the Picard documentation, in the example given, this tool takes as input a sam and returns a sorted sam. That this tool the conversion itself? Can I convert sam to bam without soring the read??
Why duplicate reads are removed?
What really means COMPRESSION_LEVEL? I have seen that the higher this value is the longer it takes but do I lose data?

In general, it is me or picard 's people didn't spend much time in documentation?

NGS • 944 views

ADD COMMENT • link updated 3.7 years ago by Carlo Yague 9.0k • written 3.7 years ago by ManuelDB ▴ 110

score 2 · Accepted Answer · 2021-11-14

Sam can be converted to bam without sorting (for instance using the command samtools view input.sam -o output.bam). However, a coordinate-sorted bam is required to mark duplicate because it is much more efficient to identify the duplicates if they are next to each other (a consequence of the sorting, since duplicates have the same coordinates by definition).
Duplicates can be either optical, biological or PCR duplicates. In the case of optical or PCR duplicates (see Duplicates on Illumina ),they can cause spurious variant calling, so they are usually filtered out for that application.
Compression level (from 0 to 9) reflects the level of data compression in the bam vs sam file. Higher level means that the bam file will be smaller in size, but it takes longer to compress/decompress. There is absolutely no data loss.