I am creating my own NGS pipeline (from Illumina fastq to vcf file). I am using best practices GATK and the pipeline already created in the clinical lab I am working.
I have seen that the fastq is converted to sam (mapping included) and then the following lines of code are
java -Xmx4000m "$javatmp" -jar "$picardpath" SortSam \ INPUT=/home/mdb1c20/my_onw_NGS_pipeline/files/sam/1.sam \ OUTPUT=/home/mdb1c20/my_onw_NGS_pipeline/files/bam/1_sorted.bam \ SORT_ORDER=coordinate \ COMPRESSION_LEVEL=5
after this, I have seen that duplicates are marked which means take the best one and remove duplicates.
My questions are:
Why reads are sorted? Efficiency ?? In the Picard documentation, in the example given, this tool takes as input a sam and returns a sorted sam. That this tool the conversion itself? Can I convert sam to bam without soring the read??
Why duplicate reads are removed?
What really means COMPRESSION_LEVEL? I have seen that the higher this value is the longer it takes but do I lose data?
In general, it is me or picard 's people didn't spend much time in documentation?