I ran FASTQC on a human gut metagenome sample and found that I have a duplication rate of about 80% Does this seem too high? I checked out some environmental samples and saw approximately the same rates.
I've read papers that recommend de-duplicating reads before analysis because they're most likely PCR artefacts. But I've read papers that recommend keeping all reads since some high abundance species will be sequenced deeply and some reads may be seen more than once. Any thoughts on the matter?
High read depth on small templates or in your case low-diversity metagenomes does indeed lead to high duplicate rates (see your answer to Josh Herr above). We see this on amplicon datasets all the time.
For your analysis I would attempt analysis - if mapping reads - with and without the duplicates. Remember to add a Mapping Quality filter afterwards too. There will be information in both analysis methods. Remember to check which parts of the genome duplicates are coming from.
Typically, in our pipeline we remove duplicate and low-quality reads before analysis with several tools (mapping to human + bacterial genomes) and after mapping (eg. with Picard).
You bring up a good point about diversity and sequencing depth. I sequenced 30Gbps for each of my gut samples. Gut samples are typically not very high diversity (100-200 species). So could it be that we're simply sequencing too deeply? I'm thinking I should subsample my fastq down to 10Gbps before deduplication.
Sometimes high duplication rate is a result of excessive PCR enrichments of libraries before sequencing (Illumina platform). I would image metagenomic samples are pretty diverse so that it is less likely you've exhausted all kinds of unique molecules in the samples. I'd suggest you to take a look at the library prep protocol and identify and adjust the PCR cycles used before submitting for sequencing to see whether and how duplication rate would be affected.
When you say "duplication" rate, you mean read depth, right?
There are a few options here -- ideally you want high depth for assembly, but if you have too much data you'll max out on your assembly memory. Gut shotgun metagenome data is typically not very complex when compared to soil, so I am surprised you are seeing the same read depth rates? Which publications did you see this in? I'm not surprised about the gut read depth.
To reduce high read depths for assembly, I'm going to point you to the khmer tool. Here's the documentation. (Disclaimer: I worked on this project briefly).
You'll want to map your reads back onto your assembly to establish a rank abundance curve for all the species / strains in your sample.
You bring up a good point about diversity and sequencing depth. I sequenced 30Gbps for each of my gut samples. Gut samples are typically not very high diversity (100-200 species). So could it be that we're simply sequencing too deeply? I'm thinking I should subsample my fastq down to 10Gbps before deduplication.