Hi everyone,
I am attempting the GATK Mutect2 somatic mutation calling pipeline with MC38 CRC WES data, yet I do not have a "matched normal," as the WES data came from a cell line.
Thus, I was advised to use the GRCm38 (mm10) as the reference genome, use my MC38 WES data as the tumor sample, and use the C57BL/6J mouse reference genome as my "matched normal," since I have no other option.
1) Would this be considered a sound approach? Or would I yield better results if I just ran Mutect2 without a matched normal?
2) Also, if the approach is fair, I was also wondering why my bwa-mem2 mem step is taking 8+ hours to run on the C57BL/6J genome, and how to speed it up. This is what I ran:
bwa-mem2 mem -t 4 -R @RG\tID:MATHCED_NORMAL\tPL:ILLUMINA\tLB:ERR9880493" ${reference_genome} ${matched_normal}/ERR9880493_1.fastq.gz ${matched_normal}/ERR9880493_2.fastq.gz > ${aligned_reads}/ERR9880493.paired.sam
My file sizes are approximately 65 GB each. I apologize if this is a rudimentary question, I'm very new to bioinformatics. Any advice would be greatly appreciated.
split the fastqs in , say, 10 parts, run the mapping in parallel, pipe into
samtools sort
to produce bam and merge the bams later.Solid advice, thank you.
and
Considering the file sizes and the fact that you are using 4 cores, normal.
yes