I have four fastq file that correspond two paired end lane (lane1: L1R1.fastq, L1R2.fastq and lane2:L2R1,L2R2). Mapping to reference performed using below command via “bwa mem”.
./bwa mem –M –t 4 ref.fasta L1R1.fq L1R2.fq > D1.sam
./bwa mem –M –t 4 ref.fasta L2R1.fq L2R2.fq > D2.sam
Now I have 2 sam files and my final goal is variant calling (SNP, indel) and variant annotation for my non model organism.
- I want to convert sam to bam (and sorting) using Picard or samtools. Which mentioned programs do you recommend? Sorting is need in this step?
- I want to define read group for two bam files separately (using Picard) and then merg them to one bam file (big.bam).
- sorting big.bam file. Sorting is need if I sorted two bam files (before merging) in step 1?
- marking duplicates using Picard tool.
- building bam index and then Create Realignment Targets using GATK and finally variant calling.
Mentioned workflow is standard and correct way to reaching to the aim? in my workflow, definition Read Group and merging bam files done in right steps? I read the workflow https://gencore.bio.nyu.edu/variant-calling-pipeline/, but it worked with one bam file without need to definition Read Group.
Additional questions : D1.sam (147 G) and D2 (145 G) are big files and merging them will create a very large file that handling it is hard in my opinion. GATK and Picard can handle it in 32G RAM computer?
Alternatively, you can merge the fastq files per read direction. Try to work with bam files instead of sam files, those take far less space.
i added two different RG to 2 bam files correspondin to one sample and then merged them in one bam file (Using Picard), but size of merged bam file (70 G) is less than sum of two original bam files (bam1= 48 G and bam2= 49 G). why? everything is right?
That's possible. BAM is compressed, merging bams with similar sequences can lead to better compression, especially if sorted.