Question

Contamination of human DNA in genome sequencing data. How to detect and eliminate?

1

Entering edit mode

7.2 years ago

polykoz ▴ 10

Hi guys!

I am looking for a automated way to detect and eliminate possible contamination of human NGS samples with human DNA. Say I have a genotype profile of a technician who runs NGS on a regular basis in a lab. How do I detect the reads that have possibly come from that technician in my NGS data? Is there a tool ready that can do that? I feel like the SNP based method is the only chance to get rid of at least some contamination but that won't be fast and easy (see the link bellow).

Detecting and Estimating Contamination of Human DNA Samples in Sequencing and Array-Based Genotype Data

Thanks for help!

NGS contamination SNP genotype • 4.1k views

ADD COMMENT • link 7.2 years ago by polykoz ▴ 10

2

Entering edit mode

You can use bbsplit.sh from BBMap suite with any number of genomes to bin your reads (BBSplit syntax for generating builds for the reference genome and how to call different builds. ). It is possible to choose how to handle reads that multi-map across genomes (with ambig2= option).

Note: I am assuming that contamination with human DNA indicates the samples of interest are not other human DNA.

ADD REPLY • link 7.2 years ago by GenoMax 148k

0

Entering edit mode

Thank you, geomax! Though I must have misled you a bit. I meant contamination of human NGS samples with human DNA. This is a bit tougher task than simple contamination with other species.

ADD REPLY • link 7.2 years ago by polykoz ▴ 10

1

Entering edit mode

Indeed. I don't think there is going to be any way to eliminate human contamination of the type you describe. Preventing that kind of contamination by meticulous lab practices is your best bet.

ADD REPLY • link 7.2 years ago by GenoMax 148k

1

Entering edit mode

We have used conpair https://github.com/nygenome/Conpair for cancer samples. I think it requires a normal sample as well, so it might not work your your task. At least, It was able to detect 2% contamination fairy well, but it might depend on read depth and number of SNPs included. This tool does not look for one particular foreign genome like your technicians.

I do not think there is a reliable way to remove the contamination. That sounds close to impossible to me. Most reads would be similar for the samples, only reads covering a SNP may be positively identified as foreign.

Even the best lab practices may sometimes fail, and the machines may also fail in unpredictable ways. It is a good lab practice to. check for contamination regardless.

You seem to single out your technician, I am curious why is it so? To my experience the most likely contamination would come from other samples. What technology do you use?

ADD REPLY • link 7.2 years ago by vegard nygaard ▴ 320

0

Entering edit mode

Thanks for the link! That could work. I am definitely gonna try it out, taking a proposed technician genome as a "normal sample".

The most likely contamination indeed usually comes from neighbor samples, however I am working with cell free DNA samples from pregnant women that are already "contaminated" with fetus genome. I feel like comparing those samples with each other would be even more challenging task. Of course, cross sample contamination checkup will be my next step, but I doubt success here. Our lab practices are fairly good, and we minimized the possibility of screwing up samples by using Ion Chef (by Thermo) for automated library preparation and chip loading, however you never know what can go wrong and I feel that we need at least some means of screening for contamination.

ADD REPLY • link 7.2 years ago by polykoz ▴ 10

0

Entering edit mode

You aim to use conpair in a way it was not intended. To clarify, "normal sample" in a cancer setting means non-cancerous cells from the same patient. So conpair uses blood and tumor, both from the same patient. If you provide your technicians DNA as normal and a patients DNA as "tumor" conpair will report a lack of concordance meaning that you have mixed up your samples. It may also report something on the contamination, however I am not sure how meaningful this will be.

Also, conpair was designed to work on a exome or full genome using about 7000 SNPs. If your cell-free DNA assay only covers a limited set of amplicons, it may not work at all.

ADD REPLY • link 7.2 years ago by vegard nygaard ▴ 320

0

Entering edit mode

Hi, Hopefully others have a solid answer for this. This is something of interest to us too... In our validation runs for a clinical NGS assay we included a contam_test that was a combination of 95% sample1 and 5% sample2. We hope to implement a check at some point in the pipeline for low level contamination, but have not yet.

I've played around with VerifyBamID (same group as the link you posted) but don't have it working in our case yet. Another thought that may help, plotting variant allele freq by position in the genome. This should show contamination as a band outside of the expected 0,0.5,1 bands.

edit: sorry, my comment completely doesn't address removing the contaminating reads once detected...

ADD REPLY • link 7.2 years ago by Robert Sicko ▴ 630

0

Entering edit mode

Ouch - my take on this is that its not going to be possible. Even the expected 0.0 0.5 and 1.0 bands are not "bands" but really very fuzzy distributions.

You would need very deep and high quality very well mapped data and advanced SNP callers. Heterozygotes in 5 % contaminated data are going to be impossible to detect. In fact, with 5 % contaminated data, I would just raise the thresholds for heterozygous snp calling from the usual 0.2 or similar to a higher value.

The best approach is going to be improving lab procedures. You could try to detect contamination % by looking at sites with a ploidy of 3-4, since they should not exist.

ADD REPLY • link 7.2 years ago by colindaven 7.0k

0

Entering edit mode

Thanks for the thoughts on this... neat idea on multiploidy sites

ADD REPLY • link 7.2 years ago by Robert Sicko ▴ 630

0

Entering edit mode

Yep, that`s something to look at! Though we would still need to include sequencing errors in the model...

ADD REPLY • link 7.2 years ago by polykoz ▴ 10