Shotgun metagenomics high duplication read rate - how high is too high?
3
1
Entering edit mode
5.2 years ago

I ran FASTQC on a human gut metagenome sample and found that I have a duplication rate of about 80% Does this seem too high? I checked out some environmental samples and saw approximately the same rates.

I've read papers that recommend de-duplicating reads before analysis because they're most likely PCR artefacts. But I've read papers that recommend keeping all reads since some high abundance species will be sequenced deeply and some reads may be seen more than once. Any thoughts on the matter?

WMS shotgun metagenomics • 4.2k views
ADD COMMENT
4
Entering edit mode
5.2 years ago

High read depth on small templates or in your case low-diversity metagenomes does indeed lead to high duplicate rates (see your answer to Josh Herr above). We see this on amplicon datasets all the time.

For your analysis I would attempt analysis - if mapping reads - with and without the duplicates. Remember to add a Mapping Quality filter afterwards too. There will be information in both analysis methods. Remember to check which parts of the genome duplicates are coming from.

Typically, in our pipeline we remove duplicate and low-quality reads before analysis with several tools (mapping to human + bacterial genomes) and after mapping (eg. with Picard).

ADD COMMENT
0
Entering edit mode

You bring up a good point about diversity and sequencing depth. I sequenced 30Gbps for each of my gut samples. Gut samples are typically not very high diversity (100-200 species). So could it be that we're simply sequencing too deeply? I'm thinking I should subsample my fastq down to 10Gbps before deduplication.

ADD REPLY
2
Entering edit mode
5.2 years ago
Vitis ★ 2.6k

Sometimes high duplication rate is a result of excessive PCR enrichments of libraries before sequencing (Illumina platform). I would image metagenomic samples are pretty diverse so that it is less likely you've exhausted all kinds of unique molecules in the samples. I'd suggest you to take a look at the library prep protocol and identify and adjust the PCR cycles used before submitting for sequencing to see whether and how duplication rate would be affected.

ADD COMMENT
1
Entering edit mode
5.2 years ago
Josh Herr 5.8k

When you say "duplication" rate, you mean read depth, right?

There are a few options here -- ideally you want high depth for assembly, but if you have too much data you'll max out on your assembly memory. Gut shotgun metagenome data is typically not very complex when compared to soil, so I am surprised you are seeing the same read depth rates? Which publications did you see this in? I'm not surprised about the gut read depth.

To reduce high read depths for assembly, I'm going to point you to the khmer tool. Here's the documentation. (Disclaimer: I worked on this project briefly).

You'll want to map your reads back onto your assembly to establish a rank abundance curve for all the species / strains in your sample.

ADD COMMENT
0
Entering edit mode

Hi, high read depth won't be a problem for me. I'm referring to sequence duplication - distinct reads that have more than one copy. For example

ADD REPLY

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6