Question

Find human contaminants in metagenomics microbiome studies

0

Entering edit mode

7.9 years ago

David ▴ 240

Hi, Dealing with samples (2x150bp) from an illumina WGS HiSeq experiment i´m trying to consolidate my pipeline by removing possible human contaminants from my samples (my test sample is a community Mock that i have ordered). Initial amplicons were 400bp long so i don´t expect reads (2x150bp) to overlap.

I was wondering if other have used bwa to remove reads mapped to the human genome.

My idea is to use the following:

bwa mem reads.R1.fq reads.R2.fa > output.sam

Then use samtools with -f4 tag to get unmapped reads and discard human contaminants.

samtools -f4 output.sam > unmapped_reads samtools -F4 output.sam > mapped_reads

I was wondering if this is suitable aproach or if i need to fine tune some parameters ? and how are you dealing with BWA mapped reads, meaning you might have reads partially mapped to the human genome that could be rescued ?? How to control for %identity ?

Thanks for your comments..

bwa microbiome metagenomics wgs • 3.8k views

ADD COMMENT • link updated 2.6 years ago by dario.galanti92 ▴ 10 • written 7.9 years ago by David ▴ 240

score 1 · Answer 1 · 2016-12-24

1

Entering edit mode

7.9 years ago

GenoMax 147k

BBMap suite has a tool to do just this task. See this thread for more information.

ADD COMMENT • link 7.9 years ago by GenoMax 147k

score 0 · Answer 2 · 2016-12-24

0

Entering edit mode

7.9 years ago

shenwei356 8.7k

Here's a guide from http://www.metagenomics.wiki/:

Remove host sequences

Steps:

a) bowtie2 mapping: write all (mapped/unmapped) reads to a single .bam file
b) Samtools: use flags to extract unmapped reads 
c) bedtools: split paired-end reads into separated files

The main difference from your's:

# SAMtools SAM-flag filter: get unmapped pairs (both ends unmapped)
samtools view -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam > SAMPLE_bothEndsUnmapped.bam
-f 12     Extract only (-f) alignments with both reads unmapped: <read unmapped><mate unmapped>
-F 256   Do not(-F) extract alignments which are: <not primary alignment>

For myself, I didn't find the most suitable method. The method above removes too much reads when comparing the microbe composition (by diamond+megan) to that of binning software such as kaiju.

ADD COMMENT • link 7.9 years ago by shenwei356 8.7k

0

Entering edit mode

I feel like I am missing something really obvious, but can someone explain to me why a read with both mates unmapped (SAM flag 12) would have a not primary alignment (SAM flag 256)? I can see myself that few of these reads exist because adding the -F 256 to the comman does make a (very small) difference, but I do not understand why an unmapped read would have a not primary alignment (basically without having a primary one??). Thanks for any clarification

ADD REPLY • link 2.6 years ago by dario.galanti92 ▴ 10

0

Entering edit mode

Ok, after reading extensively I understand. The confusion comes from the fact that some aligners and tools (such as Picard MarkDuplicates) assign "not primary alignment" flag (256), normally aimed at flagging alternative locations where a read can map, to "supplementary alignments" (flag 2048), which are chimeric/split reads having different parts of the read mapping to different locations in the genome. The latter do exist and we want to remove them when extracting exogenous reads not belonging to the host. CONCLUSION: Be aware of the behaviour of your aligner to know whether flag 256 or 2048 should be used to remove chimeric reads.

ADD REPLY • link 2.6 years ago by dario.galanti92 ▴ 10

score 0 · Answer 3 · 2016-12-24

Thanks for your comments,

With samtools -f 12 and -F 256 you discard singletons where one pair is mapped and the other pair is unmapped which looks astonishing as singletons can be incorporated into the assembly process. Am i correct ?? Also you don´t have control over the percent identity ??

I ´m going to give bbmap a try as follows (will report result):

bbmap.sh -Xmx40g minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=db/hg19/ build=1 qtrim=rl trimq=10 untrim in=Sample.R1.non-matching.fa.gz in2=Sample.R2.non-matching.fa.gz outu=Sample.unmapped.fa.gz outm=Sample.mapped.fa.gz threads=10

It looks to me that the filtering of the samfile is critical here ?