Hello all,
First I'm sorry if this is a question for which there is already an answer but I haven't really been able to come up with a clear answer.
To align my reads I use a grch38 reference which contains lots of chrUn or decoy type contig.
After alignment I would like to remove these reads from the bam but also remove any reference to these contig in the header (because it seems that just having these contig in the header slows down some gatk tools). I had thought of ReorderSam but the problem is that it will sort the bam by coordinates, but for the rest, and in particular markduplicatespark the bam must be sorted by queryname so I would like to avoid it being sorted by coordinates to then have to re-sort it by queryname ...
Thank you in advance for any suggestions.
Quentin
this is usually a very bad idea. You should always keep the very same reference along all your workflow starting with the mapping of the reads.
to remove some contigs, just use
samtools view -L bed_of_contigs_you_want_to_keep.bed -O BAM -o out.bam in.bam
Thank you for the answer.
This is actually what I thought (I had used samtools view to remove unnecessary reads that mapped on these contigs but without touching the header) but GATK said "Yes, with more contigs, this can certainly slow down the job. I would recommend if possible removing those lines in the header. "
I don't really understand how this can slow down knowing that there is no longer any data on these contig (not only have I removed the reads, but in addition the variant calling is done by interval ...).
For info the tools that are very slow are GenomicsDBImport (run in more than 2 days for 6 gvcf on a sbatch jobs with 5 threads and 8Go of mem) and GenotypeGVCFs.
I will retry GenomicsDBImport with output database writing on the /tmp in the compute node instead of the /shared storage volume maybe it will be sufficient to speed up...
split your jobs by region/chromosome
Putting the output on the /tmp disk of the compute node instead of the /shared of the storage changes everything ... for both tools I went from 6 days to 15 hours. but for the next ones I keep the option to do by interval