BWA align in human and parasite genome, and sort them
1
0
Entering edit mode
10 days ago
Bender • 0

Hi, I have two approaches which should give me same result but gave me different Method 1: Using BWA, I aligned my fastq with concatenated Plasmodium and human genome, then I used samtools to sort plasmodium and human reads in different files using bed tool.

Run BWA MEM Method 1

bwa mem -t 20 -M -R "@RG\tID:${prefix}\tLB:${prefix}\tPL:ILLUMINA\tSM:${prefix}\tPU:${prefix}" ${REF_DIR}/${REFERENCE} ${forward} ${reverse} > ${OUTDIR}/${prefix}_aligned.sam

# Convert SAM to BAM using samtools
samtools view -bS ${OUTDIR}/${prefix}_aligned.sam > ${OUTDIR}/${prefix}_aligned.bam
# Optionally remove SAM file
rm ${OUTDIR}/${prefix}_aligned.sam

# GATK CleanSam (Remove duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" CleanSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}_aligned.bam -O ${OUTDIR}/${prefix}.clean.bam

# GATK SortSam (Sort BAM file by coordinate)
gatk --java-options "-Xmx40g -Xms40g" SortSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.clean.bam -O ${OUTDIR}/${prefix}.sorted.bam -SO coordinate --CREATE_INDEX true

# GATK MarkDuplicates (Mark duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" MarkDuplicates -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.sorted.bam -O ${OUTDIR}/${prefix}.sorted.dup.bam -M ${OUTDIR}/${prefix}_dup_metrics.tsv

# Use samtools to filter by bed files (*Plasmodium* and Human)
samtools view -b -h ${OUTDIR}/${prefix}.sorted.dup.bam -T ${REF_DIR}/Pf3D7.fasta -L ${REF_DIR}/Pf3D7_core.bed > ${OUTDIR}/${prefix}.sorted.dup.pf.bam
samtools view -b -h ${OUTDIR}/${prefix}.sorted.dup.bam -T ${REF_DIR}/genome.fa -L ${REF_DIR}/human.bed > ${OUTDIR}/${prefix}.sorted.dup.hs.bam

My sample is blood sample (containing largely human genome with smaller parasite genome). I got 60M of reads for Plasmodium.

Method2: I used only Plasmodium genome to align because I need reads aligned to Plasmodium.

Run BWA MEM Method 2

bwa mem -t 20 -M -R "@RG\tID:${prefix}\tLB:${prefix}\tPL:ILLUMINA\tSM:${prefix}\tPU:${prefix}" ${REF_DIR}/${REFERENCE} ${forward} ${reverse} > ${OUTDIR}/${prefix}_aligned.sam

# Convert SAM to BAM using samtools
samtools view -bS ${OUTDIR}/${prefix}_aligned.sam > ${OUTDIR}/${prefix}_aligned.bam
# Optionally remove SAM file
rm ${OUTDIR}/${prefix}_aligned.sam

# GATK CleanSam (Remove duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" CleanSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}_aligned.bam -O ${OUTDIR}/${prefix}.clean.bam

# GATK SortSam (Sort BAM file by coordinate)
gatk --java-options "-Xmx40g -Xms40g" SortSam -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.clean.bam -O ${OUTDIR}/${prefix}.sorted.bam -SO coordinate --CREATE_INDEX true

# GATK MarkDuplicates (Mark duplicate reads)
gatk --java-options "-Xmx40g -Xms40g" MarkDuplicates -R ${REF_DIR}/${REFERENCE} -I ${OUTDIR}/${prefix}.sorted.bam -O ${OUTDIR}/${prefix}.sorted.dup.pf.bam -M ${OUTDIR}/${prefix}_dup_metrics.tsv

# Index the filtered BAM files
samtools index -bc ${OUTDIR}/${prefix}.sorted.dup.pf.bam

From these methods, I thought I would get same result From method 1 is got 60M .sorted.dup.pf.bam and grom method 2 I got .sorted.dup.bam 1.5GB. I want reasons why they are different and which should I follow, I need only parasite genome for further analysis.

BWA • 239 views
ADD COMMENT
2
Entering edit mode
10 days ago
GenoMax 148k

I thought I would get same result From method 1 is got 60M .sorted.dup.pf.bam and grom method 2 I got .sorted.dup.bam 1.5GB. I want reasons why they are different and which should I follow,

This is because reads will multi-map across genomes due to sequence similarly by chance. Method you are using has no way to account for that. You could be very strict and go with 60M that are uniquely plasmodium or ...

You may want to look at bbsplit.sh from BBMap Suite instead. It allows you to handle reads that multi-map to both genomes using a logical decision and at the same time will bin reads into genome specific pools. Specific option is ambig2= that you can pay attention to.

ambiguous2=<best>   Set behavior only for reads that map ambiguously to multiple different references.
                    Normal 'ambiguous=' controls behavior on all ambiguous reads;
                    Ambiguous2 excludes reads that map ambiguously within a single reference.
                       best   (use the first best site)
                       toss   (consider unmapped)
                       all   (write a copy to the output for each reference to which it maps)
                       split   (write a copy to the AMBIGUOUS_ output for each reference to which it maps)
ADD COMMENT

Login before adding your answer.

Traffic: 1594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6