I'm trying to understand the samtools flags -f
and -F
better...
In a scenario where I am trying to retrieve 'clean' bacterial reads from a metagenomic sample that contains host reads and bacterial reads, I first use bowtie2 to map the reads to an index made of the host genome then I use the below command to get everything that has been unmapped (i.e. everything that is NOT the host):
samtools view --threads 60 -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam > SAMPLE_bothReadsUnmapped.bam
Now, If I want to repeat this step but this time to get the mapped reads (i.e. everything that is NOT bacteria), what would be the opposite of -f 12 -F 256
? would it be -f 2 -F 12
? where -f 12 = output all reads unmapped and mates unmapped
and -F 256 = do not output reads that shows primary alignment
and -f 2 = output all reads that mapped in pairs
and -F 12 = do not output reads that were unmapped and mates that were unmapped
Please correct me if my understanding of those flags/numbers in these simple terms is inaccurate. Thanks.
I am not answering your question directly but want to point out that you can use
bbsplit.sh
(part of BBMap suite) to bin your reads against multiple genomes in one step. A different program calledseal.sh
can do this by k-mer matching. A guide for seal is available.Thanks for that - I'm familiar with
bbsplit.sh
but notseal.sh
. I plan to look at seal.sh more closely but do you happen to know if it would work well with a mixed metagenomic sample (i.e. prokaryotic + eukaryotic reads)?I don't think your -F256 is adding anything, since kind of by definition, unmapped reads won't have secondary alignments.
And of course, if you have genomes for your bacterial contamination, you need to eventually align to a combined genome of them and the host, because aligning to host alone is likely going to result in the aligner forcing some contaminant reads to align where they do not belong.
My understanding is that
-F 256
is avoiding adding mapped reads to the output...and so I was wondering what its 'opposite' would be. This is from an example I found here: https://sites.google.com/site/wiki4metagenomics/tools/short-read/remove-host-sequencesYou're right it is very likely that some reads will align where they shouldn't and that's why I'm going through a series of bowtie2 alignment trials while trying to wrap my head around these flags in simpler terms.
No, that's not what the 256 flag means. That's not even what your documentation says it means. But you could always try it with and without and count to see if the # of reads changes.
Perhaps this will help you, from samtools flagstat: http://www.htslib.org/doc/samtools-flagstat.html
secondary: 0x100 bit set
supplementary: 0x800 bit set
duplicates: 0x400 bit set
mapped: 0x4 bit not set
paired in sequencing: 0x1 bit set
read1: both 0x1 and 0x40 bits set
read2: both 0x1 and 0x80 bits set
properly paired: both 0x1 and 0x2 bits set and 0x4 bit not set
with itself and mate mapped: 0x1 bit set and neither 0x4 nor 0x8 bits set
singletons: both 0x1 and 0x8 bits set and bit 0x4 not set
For example, properly paired would be: