Hi, I have two approaches which should give me same result but gave me different Method 1: Using BWA, I aligned my fastq with concatenated Plasmodium and human genome, then I used samtools to sort plasmodium and human reads in different files using bed tool.
Run BWA MEM Method 1
bwa mem -t 20 -M -R "@RG\tID:${prefix}\tLB:${prefix}\tPL:ILLUMINA\tSM:${prefix}\tPU:${prefix}" ${REF_DIR}/${REFERENCE} ${forward} ${reverse} > ${OUTDIR}/${prefix}_aligned.sam
# Convert SAM to BAM using samtools
samtools view -bS ${OUTDIR}/${prefix}_aligned.sam > ${OUTDIR}/${prefix}_aligned.bam
# Optionally remove SAM file
rm ${OUTDIR}/${prefix}_aligned.sam
# GATK CleanSam (Remove duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" CleanSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}_aligned.bam -O ${OUTDIR}/${prefix}.clean.bam
# GATK SortSam (Sort BAM file by coordinate)
gatk --java-options "-Xmx40g -Xms40g" SortSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.clean.bam -O ${OUTDIR}/${prefix}.sorted.bam -SO coordinate --CREATE_INDEX true
# GATK MarkDuplicates (Mark duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" MarkDuplicates -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.sorted.bam -O ${OUTDIR}/${prefix}.sorted.dup.bam -M ${OUTDIR}/${prefix}_dup_metrics.tsv
# Use samtools to filter by bed files (*Plasmodium* and Human)
samtools view -b -h ${OUTDIR}/${prefix}.sorted.dup.bam -T ${REF_DIR}/Pf3D7.fasta -L ${REF_DIR}/Pf3D7_core.bed > ${OUTDIR}/${prefix}.sorted.dup.pf.bam
samtools view -b -h ${OUTDIR}/${prefix}.sorted.dup.bam -T ${REF_DIR}/genome.fa -L ${REF_DIR}/human.bed > ${OUTDIR}/${prefix}.sorted.dup.hs.bam
My sample is blood sample (containing largely human genome with smaller parasite genome). I got 60M of reads for Plasmodium.
Method2: I used only Plasmodium genome to align because I need reads aligned to Plasmodium.
Run BWA MEM Method 2
bwa mem -t 20 -M -R "@RG\tID:${prefix}\tLB:${prefix}\tPL:ILLUMINA\tSM:${prefix}\tPU:${prefix}" ${REF_DIR}/${REFERENCE} ${forward} ${reverse} > ${OUTDIR}/${prefix}_aligned.sam
# Convert SAM to BAM using samtools
samtools view -bS ${OUTDIR}/${prefix}_aligned.sam > ${OUTDIR}/${prefix}_aligned.bam
# Optionally remove SAM file
rm ${OUTDIR}/${prefix}_aligned.sam
# GATK CleanSam (Remove duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" CleanSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}_aligned.bam -O ${OUTDIR}/${prefix}.clean.bam
# GATK SortSam (Sort BAM file by coordinate)
gatk --java-options "-Xmx40g -Xms40g" SortSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.clean.bam -O ${OUTDIR}/${prefix}.sorted.bam -SO coordinate --CREATE_INDEX true
# GATK MarkDuplicates (Mark duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" MarkDuplicates -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.sorted.bam -O ${OUTDIR}/${prefix}.sorted.dup.pf.bam -M ${OUTDIR}/${prefix}_dup_metrics.tsv
# Index the filtered BAM files
samtools index -bc ${OUTDIR}/${prefix}.sorted.dup.pf.bam
From these methods, I thought I would get same result From method 1 is got 60M .sorted.dup.pf.bam and grom method 2 I got .sorted.dup.bam 1.5GB. I want reasons why they are different and which should I follow, I need only parasite genome for further analysis.