After generating a BAM, looking at which reads map to a known contaminant (human host) and which don't (theoretically, gut contents across tree of life).
From BAM, was thinking of extracting paired end reads using samtools fastq -1 R1.fastq -2 R2.fastq -f 13 input.bam
; rationale being f
(only include reads with all of the FLAGs in INT present) with 13 is:
1 1 Read paired
3 4 Read unmapped
4 8 Mate unmapped
Thus if reads are mapped to human host, reads output with f13 that are paired, unmapped and mate unmapped are likely the reads I want (to the limits of the human reference sequence itself being clean).
And to evaluate the FASTQ reads corresponding to human, was thinking F (only include reads with none of the FLAGS in INT present) 12, which includes reads that are not unmapped and mate unmapped. I am still somewhat in the dark about "proper pair" and whether or not reads where mates are on different contigs might cause issues. To me, F13 would mean including things that were not 1, or not paired, which might make things weird? Otherwise going with pair & properly paired would be straightforward versus this not/unpaired business.
f13 would appear to be the equivalent of the more common 77/141 (with additional 64 and 128 to pick R1 and R2); F12, being a negative, might throw a bunch of unusual confounders my way...
sounds about right to me, I did it similar. I wouldn't mind the pairing at all (except filtering for both reads unmapped), that flag is always true, unless you have unpaired input data.
"Proper pairs" essentially means both reads are in the correct orientation and within the expected distance range, see this thread
I would suggest binning the reads using human genome with
bbsplit.sh
from BBMap suite.