Dear all,
I've just gotten a set of paired-end metagenomic sequencing data, performed on the Illumina HiSeq2000 platform (read length=100 bp, insert size=350 bp) and I have two questions about QC steps (trimming and decontamination) -- I have no experience on this kind of data and any help will be highly appreciated :)
Question #1: People in my lab suggested Trim Galore! as trimming software with default values, except for Phred score = 30 and overlap of at least 3bp with adapter sequence required to trim a sequence. Do these values make sense in a metagenomic context?
Question #2: I was thinking of using deconseq for the sequence decontamination, but, to the best of my understanding, the results are highly dependent on the chosen parameters, that are the percentage of alignment identity and coverage threshold, and three BWA-SW parameters: chunk size of reads, Z-best value, and alignment score threshold. How to select these in order to have the best results?
Thank you very much in advance!
Thank you very much for your answer, Brian! Problem is that now I have even more questions -- most of them may sound very dumb, sorry about that :)
Quality trimming: why you don't do quality-trimming prior to assembly? To avoid bias? I already know that I will be asked why I decreased from (the suggested) Q30 to Q8, and why 8 is a good number. I mean, It there a thumb rule to decide what is good Phred score, for instance looking at fastQC results? Do you suggest error-correcting? I am also asking Trim Galore to keep unpaired reads, that I will merge with the paired reads before assembly. Does it make sense to you?
Decontamination: we know that there is human contamination and from some other organisms that were sequenced with our samples (e.g., zebrafish) and created the databases as explained on deconseq website. I haven't thought about synthetic contaminants, thanks for pointing this out. Can I use bbduk to remove contamination from any organism? Do I do so by adding the organism to the ref parameter? What (and how) BBduk parameters should be set?
I've also been told to remove exact duplicates (since they could be technical duplicates) and I am doing it with BBmap clumpify:
does this step make sense to you?
Thank you again!
Assemblers are all at least somewhat robust to low quality (they have to be), and most take into account quality scores. So quality-trimming is typically not vital, but of course, it depends on the assembler. And yes, as you surmise, the reason is mainly to avoid bias, but also to avoid data loss in low-coverage areas. If you have very deep, uniform coverage of an isolate and the platform has zero sequence-related quality bias, then absolutely, trimming to Q30 would be great! But that doesn't happen in practice, with Illumina. Rather, the more closely your data approximates this, the more you benefit from aggressive quality-trimming, and vice-versa; since we assemble a lot of extreme-GC organisms, single-cell experiments, and metagenome experiements, we are probably hurt more from aggressive quality-trimming than a lab focused on GC-neutral isolates.
Q8 and Q12 are not special numbers with a theoretical reason for being useful. But based on our empirical tests, that seems be the range where increasing the quality score cutoff stops improving the assembly. Sometime later, the assembly quality will start to decline. But this is very dependent on the data, assembler, genome, and memory available; if you keep running out of memory because of low quality data generating too many spurious kmers, then trimming above Q12 can help. On the other hand, with metagenomes or anything with very low coverage areas that you still want to assemble, trimming beyond Q12 or so starts very obviously decreasing the total genome recovery and contiguity.
In my testing, error-correction improves the results of most but not all assemblers. Specifically, it improves the results of all assemblers I've tested except for SoapDenovo/Megahit.
Not exactly. Feed the assembler both the paired and unpaired reads, but not merged together. Most assemblers have different input arguments for paired and singleton reads.
BBDuk does kmer-matching and is sufficient for short synthetic contaminants, e.g.:
It is not sufficient for large genomes that may share exact 31-mers with another genome. For that you need to use mapping against a masked reference, as indicated in this thread. A masked version of the human genome is available on my google drive (linked in that thread), and it has in fact been masked with the zebrafish genome, so actual zebrafish reads should not map to it.
That looks fine, assuming your input reads are interleaved. For paired reads in two files you need to use "in1= in2= out1= out2=". Also, there's not much reason to remove ALL duplicates unless you have a PCR-amplified library (which is not ideal for assembly). For an unamplified library, just remove the optical duplicates:
That restricts duplicate removal to nearby reads. The exact distance depends on the platform; what Illumina platform are you using? Also, this duplicate removal needs to be done BEFORE trimming, and I recommend allowing a couple of mismatches if you are only doing optical duplicate removal because the false-positive rate is incredibly low and allowing zero mismatches will enrich for reads with errors.
Hi Brian, thank you very much for your very detailed answer, much appreciated :)
I've discussed a bit better the project with my colleagues here in the lab, and it seems that the main objective is to extract the taxonomic profiles and eventually move to abundances and, thus, to opt for the HUMAnN2 pipeline, that, to the best of my understanding, does not require a prior assembly but relies on (QCed) short reads.
According to the HUMAnN paper (HUMAnN2 is still unpublished) "any step removing non-microbial DNA and low quality reads should be sufficient". On the other hand, I've understood that, also in this context, an aggressive trimming will increase data loss and thus decrease the diversity recovery. Now I guess I should dig into HUMAnN2 to understand if longer contig will work (think so) and (mostly) whether they will yield to better results --and balance this with the computational efforts.
Regarding some interesting points in your answer.
what assemblers have you tested?
We do have a PCR-amplified library. Should I then remove exact duplicates first (with no mismatches) and then optical duplicates allowing a couple of mismatches? We used Illumina HiSeq2000.
On a HiSeq 2K I don't think you should have optical duplicates (and if you do find some then they should be very small in number).
Hi Brian,
after much more talking, we decided to move from Trim Galore! to BBduk and from deconseq to BBmap (60% faster, that's amazing!), as you suggested on your first answer! I have, however, a few more questions :$
My trimming command is as following:
using the parameters you suggested but phred (set to 10, that it seemed us a good number between 8 and 12!) and minlength (set to 60 as suggested by the Humann2 community). Differently from you, however, we would like to keep the singletons reads, and wouldn't merge the paired one. My question is: how can I remove contaminants? Do these commands make any sense to you?
Where can I find the file "sequencing_artifacts.fa.gz" you used in your example? What does it include?
Also, my decontamination command is:
that uses your parameters here: http://seqanswers.com/forums/showthread.php?t=42552, but I can't really understand what is 'bw' (I haven't found it on the list of BBmap parameters). Was it any reason because you set maxindel to 3? I saw higher values elsewhere (memory? time?). Also, how could I build and add another reference (e.g., zebrafish)?
Thank you very much and sorry for all these endless dumb questions!
That file is in the
resources
directory in bbmap software. As I recall, parts of it may be contaminants seen (only?) at JGI.For decontamination you could use
bbsplit
: http://seqanswers.com/forums/showthread.php?t=41288 It allows you to specify multiple references to bin your reads against.BBmap has been installed by our IT team, and the resources directory seems to be empty (I've downloaded phix174_ill.ref.fa.gz and adapters.fa from the GitHub, but I haven't found sequencing_artifacts.fa.gz there). Will investigate with them :)
I've quickly read about bbsplit, but it is not clear to me the difference with bbmap :| Will look into it, hoping of finding my way :)
Thanks!
You can download bbmap distro from sourceforge (small download). The resources directory will be in the distribution. Copy the files you need to the server from there.
BBsplit allows multiple references to be used at the same time and gives you control over how you want bin/handle reads if the mapping is ambigous (e.g. include in more than one genome or toss altogether etc).
Do you mean here? https://github.com/BioInfoTools/BBMap (https://github.com/BioInfoTools/BBMap/tree/master/resources)
@Brian has a repo on GitHub but he asks people to use the download from here.
I see, sorry! Thanks a lot!
I corrected my post above. Should have included the download link before.