Metagenomics data: trimming and decontamination
1
2
Entering edit mode
7.8 years ago
alesssia ▴ 580

Dear all,

I've just gotten a set of paired-end metagenomic sequencing data, performed on the Illumina HiSeq2000 platform (read length=100 bp, insert size=350 bp) and I have two questions about QC steps (trimming and decontamination) -- I have no experience on this kind of data and any help will be highly appreciated :)

Question #1: People in my lab suggested Trim Galore! as trimming software with default values, except for Phred score = 30 and overlap of at least 3bp with adapter sequence required to trim a sequence. Do these values make sense in a metagenomic context?

Question #2: I was thinking of using deconseq for the sequence decontamination, but, to the best of my understanding, the results are highly dependent on the chosen parameters, that are the percentage of alignment identity and coverage threshold, and three BWA-SW parameters: chunk size of reads, Z-best value, and alignment score threshold. How to select these in order to have the best results?

Thank you very much in advance!

metagenomics Trim Galore deconseq sequencing • 7.2k views
ADD COMMENT
2
Entering edit mode
7.8 years ago

People in my lab suggested Trim Galore! as trimming software with default values, except for Phred score = 30 and overlap of at least 3bp with adapter sequence required to trim a sequence. Do these values make sense in a metagenomic context?

No, not really. Quality-trimming to Q30 is far too aggressive; you'll lose a lot of good data and increase sequence-specific coverage bias, yielding an inferior assembly. Using a minimum adapter length of 3 when most of your reads are non-overlapping will also incur sequence-specific coverage bias - that's too short. You can much better trim short adapter sequences from paired reads with BBDuk's "tbo" flag, allowing trimming of even 1bp adapter sequence without incurring bias.

As for decontamination, what are you trying to remove? You can't decontaminate unless you know either what you are sequencing, or what the contaminant is, and with an arbitrary metagenome, you don't know either, other than that human and a few select other organisms are common contaminants in reagents.

I suggest you use the first two steps in this post for adapter-trimming and contaminant removal (of synthetic contaminants). The other steps may or may not be relevant in this case depending on the assembler you are using (they are targeted at Spades). Before you try removing other contaminants, it's worthwhile to figure out what the contaminants might be, but this post might help. I don't usually do quality-trimming prior to assembly, but when I do, it's typically at around Q8 to Q12, not higher.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer, Brian! Problem is that now I have even more questions -- most of them may sound very dumb, sorry about that :)

Quality trimming: why you don't do quality-trimming prior to assembly? To avoid bias? I already know that I will be asked why I decreased from (the suggested) Q30 to Q8, and why 8 is a good number. I mean, It there a thumb rule to decide what is good Phred score, for instance looking at fastQC results? Do you suggest error-correcting? I am also asking Trim Galore to keep unpaired reads, that I will merge with the paired reads before assembly. Does it make sense to you?

Decontamination: we know that there is human contamination and from some other organisms that were sequenced with our samples (e.g., zebrafish) and created the databases as explained on deconseq website. I haven't thought about synthetic contaminants, thanks for pointing this out. Can I use bbduk to remove contamination from any organism? Do I do so by adding the organism to the ref parameter? What (and how) BBduk parameters should be set?

I've also been told to remove exact duplicates (since they could be technical duplicates) and I am doing it with BBmap clumpify:

clumpify.sh in=in.fq  out=out.fq qin=33 dedupe subs=0 threads=4

does this step make sense to you?

Thank you again!

ADD REPLY
2
Entering edit mode

Quality trimming: why you don't do quality-trimming prior to assembly? To avoid bias? I already know that I will be asked why I decreased from (the suggested) Q30 to Q8, and why 8 is a good number.

Assemblers are all at least somewhat robust to low quality (they have to be), and most take into account quality scores. So quality-trimming is typically not vital, but of course, it depends on the assembler. And yes, as you surmise, the reason is mainly to avoid bias, but also to avoid data loss in low-coverage areas. If you have very deep, uniform coverage of an isolate and the platform has zero sequence-related quality bias, then absolutely, trimming to Q30 would be great! But that doesn't happen in practice, with Illumina. Rather, the more closely your data approximates this, the more you benefit from aggressive quality-trimming, and vice-versa; since we assemble a lot of extreme-GC organisms, single-cell experiments, and metagenome experiements, we are probably hurt more from aggressive quality-trimming than a lab focused on GC-neutral isolates.

Q8 and Q12 are not special numbers with a theoretical reason for being useful. But based on our empirical tests, that seems be the range where increasing the quality score cutoff stops improving the assembly. Sometime later, the assembly quality will start to decline. But this is very dependent on the data, assembler, genome, and memory available; if you keep running out of memory because of low quality data generating too many spurious kmers, then trimming above Q12 can help. On the other hand, with metagenomes or anything with very low coverage areas that you still want to assemble, trimming beyond Q12 or so starts very obviously decreasing the total genome recovery and contiguity.

Do you suggest error-correcting?

In my testing, error-correction improves the results of most but not all assemblers. Specifically, it improves the results of all assemblers I've tested except for SoapDenovo/Megahit.

I am also asking Trim Galore to keep unpaired reads, that I will merge with the paired reads before assembly. Does it make sense to you?

Not exactly. Feed the assembler both the paired and unpaired reads, but not merged together. Most assemblers have different input arguments for paired and singleton reads.

Can I use bbduk to remove contamination from any organism? Do I do so by adding the organism to the ref parameter? What (and how) BBduk parameters should be set?

BBDuk does kmer-matching and is sufficient for short synthetic contaminants, e.g.:

bbduk.sh in=trimmed.fq out=filtered.fq k=31 ref=/bbmap/resources/sequencing_artifacts.fa.gz,/bbmap/resources/phix174_ill.ref.fa.gz

It is not sufficient for large genomes that may share exact 31-mers with another genome. For that you need to use mapping against a masked reference, as indicated in this thread. A masked version of the human genome is available on my google drive (linked in that thread), and it has in fact been masked with the zebrafish genome, so actual zebrafish reads should not map to it.

I've also been told to remove exact duplicates (since they could be technical duplicates) and I am doing it with BBmap clumpify:

That looks fine, assuming your input reads are interleaved. For paired reads in two files you need to use "in1= in2= out1= out2=". Also, there's not much reason to remove ALL duplicates unless you have a PCR-amplified library (which is not ideal for assembly). For an unamplified library, just remove the optical duplicates:

clumpify.sh in=in.fq out=out.fq qin=33 dedupe subs=0 threads=4 optical dist=40

That restricts duplicate removal to nearby reads. The exact distance depends on the platform; what Illumina platform are you using? Also, this duplicate removal needs to be done BEFORE trimming, and I recommend allowing a couple of mismatches if you are only doing optical duplicate removal because the false-positive rate is incredibly low and allowing zero mismatches will enrich for reads with errors.

ADD REPLY
0
Entering edit mode

Hi Brian, thank you very much for your very detailed answer, much appreciated :)

I've discussed a bit better the project with my colleagues here in the lab, and it seems that the main objective is to extract the taxonomic profiles and eventually move to abundances and, thus, to opt for the HUMAnN2 pipeline, that, to the best of my understanding, does not require a prior assembly but relies on (QCed) short reads.
According to the HUMAnN paper (HUMAnN2 is still unpublished) "any step removing non-microbial DNA and low quality reads should be sufficient". On the other hand, I've understood that, also in this context, an aggressive trimming will increase data loss and thus decrease the diversity recovery. Now I guess I should dig into HUMAnN2 to understand if longer contig will work (think so) and (mostly) whether they will yield to better results --and balance this with the computational efforts.

Regarding some interesting points in your answer.

In my testing, error-correction improves the results of most but not all assemblers. Specifically, it improves the results of all assemblers I've tested except for SoapDenovo/Megahit.

what assemblers have you tested?

there's not much reason to remove ALL duplicates unless you have a PCR-amplified library [...] just remove the optical duplicates: [...] That restricts duplicate removal to nearby reads. The exact distance depends on the platform; what Illumina platform are you using? [...] and I recommend allowing a couple of mismatches if you are only doing optical duplicate removal

We do have a PCR-amplified library. Should I then remove exact duplicates first (with no mismatches) and then optical duplicates allowing a couple of mismatches? We used Illumina HiSeq2000.

ADD REPLY
1
Entering edit mode

then optical duplicates allowing a couple of mismatches? We used Illumina HiSeq2000.

On a HiSeq 2K I don't think you should have optical duplicates (and if you do find some then they should be very small in number).

ADD REPLY
0
Entering edit mode

Hi Brian,

after much more talking, we decided to move from Trim Galore! to BBduk and from deconseq to BBmap (60% faster, that's amazing!), as you suggested on your first answer! I have, however, a few more questions :$

My trimming command is as following:

bbduk.sh -Xmx16g in=mysample_dedupe_R1.fq.gz in2=mysample_dedupe_R2.fq.gz out=mysample_trimmed_R1_tmp.fq out2=mysample_trimmed_R2_tmp.fq outs=mysample_trimmed_singletons.fq ktrim=r k=23 mink=11 hdist=1 qtrim=rl trimq=10 minlength=60 ref=./resources/adapters.fa qin=33 threads=4 tbo tpe cf ow

using the parameters you suggested but phred (set to 10, that it seemed us a good number between 8 and 12!) and minlength (set to 60 as suggested by the Humann2 community). Differently from you, however, we would like to keep the singletons reads, and wouldn't merge the paired one. My question is: how can I remove contaminants? Do these commands make any sense to you?

bbduk.sh in=mysample_trimmed_R1_tmp.fq in2=mysample_trimmed_R2_tmp.fq out=$mysample_trimmed_R1.fq out2=mysample_trimmed_R2.fq k=31 ref=./resources/phix174_ill.ref.fa.gz threads=4 ow

bbduk.sh in=mysample_trimmed_singletons_tmp.fq out=mysample_trimmed_singletons.fq k=31 ref=./resources/phix174_ill.ref.fa.gz threads=4 how

Where can I find the file "sequencing_artifacts.fa.gz" you used in your example? What does it include?

Also, my decontamination command is:

bbmap.sh -Xmx32g in=mysample.fq outu=mysample_clean.fq outm=mysample_cont.fq minid=0.95 maxindel=3 bwr=0.16 bw=12 minhits=2 qtrim=rl trimq=10 path=./resources qin=33 interleaved=f threads=4 untrim quickmatch fast ow

that uses your parameters here: http://seqanswers.com/forums/showthread.php?t=42552, but I can't really understand what is 'bw' (I haven't found it on the list of BBmap parameters). Was it any reason because you set maxindel to 3? I saw higher values elsewhere (memory? time?). Also, how could I build and add another reference (e.g., zebrafish)?

Thank you very much and sorry for all these endless dumb questions!

ADD REPLY
2
Entering edit mode

Where can I find the file "sequencing_artifacts.fa.gz" you used in your example?

That file is in the resources directory in bbmap software. As I recall, parts of it may be contaminants seen (only?) at JGI.

For decontamination you could use bbsplit : http://seqanswers.com/forums/showthread.php?t=41288 It allows you to specify multiple references to bin your reads against.

ADD REPLY
0
Entering edit mode

BBmap has been installed by our IT team, and the resources directory seems to be empty (I've downloaded phix174_ill.ref.fa.gz and adapters.fa from the GitHub, but I haven't found sequencing_artifacts.fa.gz there). Will investigate with them :)

I've quickly read about bbsplit, but it is not clear to me the difference with bbmap :| Will look into it, hoping of finding my way :)

Thanks!

ADD REPLY
1
Entering edit mode

You can download bbmap distro from sourceforge (small download). The resources directory will be in the distribution. Copy the files you need to the server from there.

BBsplit allows multiple references to be used at the same time and gives you control over how you want bin/handle reads if the mapping is ambigous (e.g. include in more than one genome or toss altogether etc).

ADD REPLY
2
Entering edit mode

@Brian has a repo on GitHub but he asks people to use the download from here.

ADD REPLY
0
Entering edit mode

I see, sorry! Thanks a lot!

ADD REPLY
1
Entering edit mode

I corrected my post above. Should have included the download link before.

ADD REPLY

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6