Which approaches and/or tools for detection and removal of contaminants?
2
0
Entering edit mode
7.3 years ago
Lucas Peres ▴ 80

Hello everyone.

I'm new on the forum and not a native english speaker, so I will try to be as clear as possible in my question. Forgive me for any mistakes.

Currently, I'm a CS student learning and working with Bioinformatics in a biology lab (roughly 3 months). I'm still learning to do preprocessing of sequence data and I need a tool to detect and remove contaminants. I have seen tools like Trimmomatic and Trim Galore! for filtering and trimming of primers and adapters, which are straighfoward to remove since they appear in the ends of the reads, but my advisor wants me to find an approach to clear "intra-read" contaminants (just to make sure, I don't mean low quality bases, but alien sequences that does't belong to the organism being sequenced), especially if there is such tool in the Galaxy Platform. I have found VecScreen, which hasn't served us well because the datasets are too big to be uploaded to a web based tool.

The closest solutions I found were DeconSeq (http://deconseq.sourceforge.net/) and some approaches using Biopython and/or BioPerl to do alignments with BLAST+ for detection of contaminants along with other tools to clear the dataset (like Prinseq). Until now, I haven´t seen a Galaxy server with DeconSeq (if there is one), which is why I´m trying to use the standalone version. If someone has ever used this tool, please tell me if it fullfills its purpose well. If someone knows another approach, I would be grateful to know.

I know my question may be very basic, but I decided to open a post because I have deadlines to deliver some results and I don´t want to waste time in something that won´t serve me.

Anyway, I´m very open to advises/suggestions from someone more experienced who knows how to deal with contaminants!

Many thanks.

sequence blast sequencing • 3.8k views
ADD COMMENT
1
Entering edit mode

BBMap's SendSketch is a fast way to screen raw reads for contaminants, compared to Blast. Usage:

#For nt:
sendsketch.sh in=reads.fq reads=400k

#For RefSeq:
sendsketch.sh in=reads.fq reads=400k refseq
ADD REPLY
0
Entering edit mode

Thank you very much! I didn't know BBMap, will take a look.

ADD REPLY
0
Entering edit mode

Could you refine your question, what do you mean by "contaminants"?

ADD REPLY
0
Entering edit mode

I mean a sequence of foreign origin that doesn't belong to the organism that was sequenced. More specifically, I'm seeking a solution to remove vector contaminants.

ADD REPLY
0
Entering edit mode

Thanks genomax and h.mon! I will take a look on BBTools and the other options mentioned by genomax. :)

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY
2
Entering edit mode
7.3 years ago
GenoMax 148k

This is a non-galaxy solution. If you wish you only keep reads that belong to a known genome you could use BBSplit from BBMap suite. An example would like this bbsplit.sh in=reads.fq ref=genome_interest.fa out=interesting.fq outu=clean.fq interesting.fq = reads from genome of interest, clean.fq = all other reads.

ADD COMMENT
1
Entering edit mode
7.3 years ago
h.mon 35k

Are screening fastq files for vectors the only kind of contaminants you worry about? You can use BBDuk (from BBTools suite, mentioned by genomax for BBSplit) or SeqyClean to simultaneously trim adapters and bad quality bases, and remove contaminant reads: just use UniVec as contaminants database. UniVec is the same database used by VecScreen. There are other options for fastq screening (FastQ Screen and MGA, for example), they generally rely on BWA / Bowtie for mapping reads to contaminant references. You have to set up your contaminant reference, whatever it may be.

If you want to screen for other kind of contaminants on other kinds of datasets, there are other options as well. Centrifuge, Kraken or Clark can be powerful and fast options for screening fastq files for all kinds of contaminants. And there are options for screening genome assemblies (as blobtools).

ADD COMMENT

Login before adding your answer.

Traffic: 2051 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6