Question

Detect contamination in bacterial WGS data

0

Entering edit mode

2.0 years ago

dante • 0

Hi all,

Recently I am studying a new set of Aeromonas WGS data. Kraken2 & metaphlan3 have been used for taxonomical classification: the results showed that there were contamination of two or more Aeromonas species in some of the WGS data. I have subsequently used CheckM lineage_wf and taxonomy_wf to check for contamination but the database does not cover Aeromonas species. FYI, my colleagues who did the wet-lab informed that the isolates should have been purified.

In my own opinion, the detection of multiple species (or contamination) by the abovementioned three tools may be due to incomprehensive databases which do not well cover the Aeromonas species.

May I seek your opinion if there is any other useful/effective way to check whether the microbial (not just Aeromonas) WGS data contains mixed culture of different species?

Thank you.

WGS aeromonas contamination • 1.3k views

ADD COMMENT • link updated 2.0 years ago by Asaf 10k • written 2.0 years ago by dante • 0

2

Entering edit mode

Did you try to assemble it? If assembly is good, no single copy genes appear more than once then it's probably the genetics of the species you isolated.

ADD REPLY • link 2.0 years ago by Asaf 10k

0

Entering edit mode

Hi @Asaf ,

Thank you for your advice. I am new to bioinformatics. I have checked the assemblies, the N50 and total length are pretty consistent.

May I check with you how to proceed with find the single copy genes? What tool will you recommend?

Thank you!

Best, dante

ADD REPLY • link 2.0 years ago by dante • 0

0

Entering edit mode

You can use checkM

ADD REPLY • link 2.0 years ago by Asaf 10k

score 2 · Answer 1 · 2022-11-10

2

Entering edit mode

2.0 years ago

colindaven 7.0k

I would map against very close reference genomes, call SNPs, filter to high confidence SNPs and compare the numbers against various references. You can compare and contrast public datasets too to get a handle on how many SNPs to expect.

If its significant contamination say 50%, you might get a lot of "biallelic" SNPs in a haploid organism.

We used to do this for P- aeruginosa datasets some time ago.

I wouldn't trust metagenomic programs at strain level, or sometimes species level. Of course there will be false read attribution to other Aeromonas spp.

ADD COMMENT • link 2.0 years ago by colindaven 7.0k

0

Entering edit mode

Hi colindaven

Thank you for your advice. I am new to this field, can you please elaborate more on the method?

What will be a good yet easy-to-use tool for SNP calling? Freebayes? Also what is a good threshold for filtering high confidence SNPs? Or the threshold is depending on the context?

To compare the numbers against various references, do you mean we will need to map against several close reference genomes and get their SNPs and compare to each other?

By the way, checking for "biallelic" SNPs in the haploid organisms is in fact a smart way to quickly understand if there is contamination in microbial WGS data. Thank you so much for your suggestion!

Best, Dante

ADD REPLY • link 2.0 years ago by dante • 0

0

Entering edit mode

Yes, exactly. Freebayes is easy, use multisample (put several BAMs as input at once) to get a VCF with genotypes for each of your samples in one handy table.

Filter by Freebayes score - check distribution of scores and take something which appears good visually checking BAM files for good alignments (haven't done this for bacteria for a while).

Then repeat for each near/close reference if you're keen.

Pipelines such as snippy on github https://github.com/tseemann/snippy might be interesting for you, but haven't used it. There are others too.

We had contamination once - 3/38 reads had a lasR mutation as I recall. We missed it on first pass bioinformatically as expected, but collaborators found the weird phenotype and reported it and found the mutation. Other contaminations were sequencing the wrong clone by mistake and getting eg 50k SNPs more than expected.

ADD REPLY • link 2.0 years ago by colindaven 7.0k