Hi, Dealing with samples (2x150bp) from an illumina WGS HiSeq experiment i´m trying to consolidate my pipeline by removing possible human contaminants from my samples (my test sample is a community Mock that i have ordered). Initial amplicons were 400bp long so i don´t expect reads (2x150bp) to overlap.
I was wondering if other have used bwa to remove reads mapped to the human genome.
My idea is to use the following:
bwa mem reads.R1.fq reads.R2.fa > output.sam
Then use samtools with -f4 tag to get unmapped reads and discard human contaminants.
samtools -f4 output.sam > unmapped_reads samtools -F4 output.sam > mapped_reads
I was wondering if this is suitable aproach or if i need to fine tune some parameters ? and how are you dealing with BWA mapped reads, meaning you might have reads partially mapped to the human genome that could be rescued ?? How to control for %identity ?
Thanks for your comments..
I feel like I am missing something really obvious, but can someone explain to me why a read with both mates unmapped (SAM flag 12) would have a not primary alignment (SAM flag 256)? I can see myself that few of these reads exist because adding the -F 256 to the comman does make a (very small) difference, but I do not understand why an unmapped read would have a not primary alignment (basically without having a primary one??). Thanks for any clarification
Ok, after reading extensively I understand. The confusion comes from the fact that some aligners and tools (such as Picard MarkDuplicates) assign "not primary alignment" flag (256), normally aimed at flagging alternative locations where a read can map, to "supplementary alignments" (flag 2048), which are chimeric/split reads having different parts of the read mapping to different locations in the genome. The latter do exist and we want to remove them when extracting exogenous reads not belonging to the host. CONCLUSION: Be aware of the behaviour of your aligner to know whether flag 256 or 2048 should be used to remove chimeric reads.