Illumina Trimming Algorithm

5

Entering edit mode

15.0 years ago

Science_Robot ★ 1.1k

Has anyone written a script to trim fastq files? I read in another post somewhere that these scripts are a dime a dozen. So, could we share them to prevent people from eternally reinventing the wheel?

Here's mine so far, which is a modified version of a modified version of Richard Mott's trimming algorithm used in CLC Genomics Workbench:

def mott(dna, **kwargs):
    ''' Trims a sequence using modified Richard Mott algorithm '''
    try:
        limit = kwargs['limit']
    except KeyError:
        limit = 0.05
    tseq = []
    S = 0
    pos = 0
    for q, n in zip(dna.qual, dna.seq):
        pos += 1
        #if q == 'B': continue
        Q = ord(q) - 64
        p = 10**(Q/-10.0)
        s = S
        S += limit - p
        if S < 0: S = 0
        #print '%-3s %-3s %-3s %-3s %4.3f %6.3f %7.3f' % (pos, n, q, Q, p, limit - p, S),
        if s > S:
            break
        else:
            tseq.append(n)
    dna.seq = ''.join(tseq)
    dna.qual = dna.qual[0:len(dna.seq)]
    return dna

I should mention that dna is an object I've created which has some properties: seq is the sequence and qual is the quality scores (their ascii representation). This algorithm only works for Illumina 1.5 PHRED scores.

I'm using it in a pipeline for metagenome analysis of Illumina data. I'm also writing a median sliding window algorithm to see if that works better.

python trimming illumina • 20k views

ADD COMMENT • link updated 3.8 years ago by Jiacheng ▴ 60 • written 15.0 years ago by Science_Robot ★ 1.1k

6

Entering edit mode

what's wrong with the stuff in fastx toolkit?

ADD REPLY • link 15.0 years ago by brentp 24k

3

Entering edit mode

I'd also suggest that whilst the code golf is fine, and posting code to ask for help/suggestions is fine, dumping code snippets here when there are a dozen implementations elsewhere is not what BioStar should be used for. Stick it on github.

ADD REPLY • link 15.0 years ago by User 59 13k

0

Entering edit mode

I don't know who Richard Mott is but it would be helpful if you describe algorithms in English, e.g. "this trims all base pairs to the right of the first position where quality drops more than 'limit'"

ADD REPLY • link 15.0 years ago by Jeremy Leipzig 23k

0

Entering edit mode

It would be better to choose an existing software to do Illumina trimming rather than reinventing the wheel. Considering up-to-date precision and speed benchmarks, I'd recommend atria.

ADD REPLY • link 3.8 years ago by Jiacheng ▴ 60

17

Entering edit mode

14.3 years ago

Vince Buffalo ▴ 470

Our group (The UC Davis Bioinformatics Core) has some internal tools we've developed. The first is Scythe, an adapter trimmer. This uses the quality information in a FASTQ entry and a prior to decide whether a 3' substring is adapter. Very basically, it takes a naïve Bayesian approach to classifying 3'-end contaminants only. Because these are the most poor quality bases and most likely to be contaminated (especially as reads get longer and longer), Scythe is designed to specifically remove these contaminants. Removing other contaminants can be done with other tools. The prior can be set to different thresholds; I recommend using less to get a sense of the 3'-end adapter contaminant rate. If you're doing an assembly, you may desire very, very strict trimming, i.e. if the adapter contamination seems high, and the 3'-end adapter begins with GATC, removing all 3'-end GAT, GA, and GATC substrings (as well as all the longer more likely matches). We find this works well in our projects, but and feedback is welcome.

[Sickle] is a sliding window quality trimmer, designed to be used after Scythe. Unlike cutadapt and other tools, our pipelines remove adapter contaminants before quality trimming, as removing poor quality bases throws away any useful information that could be used in identifying a 3'-end adapter contaminant. Thus, our quality control system works by first looking at preliminary quality checks via my qrqc package, then running Scythe to remove adapters, then do quality trimming with Sickle, then do another qrqc run to see how the sequences have changed.

Sometimes Scythe seems a bit greedy, but upon further inspection it's almost always "too" greedy with poor quality sequences, which would be removed by Sickle anyways.

Nik (a member in our group) also wrote Sabre, a bardcode demultiplexing and trimming too (I think this is still alpha though).

Any feedback would be appreciated!

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 14.3 years ago by Vince Buffalo ▴ 470

0

Entering edit mode

Is there a page with the whole workflow for paired end data in code or pseudocode? In other words, the code for scythe, then the code needed for sickle, (and maybe sabre), followed by an assembler, eg VelvetOptimiser?

ADD REPLY • link 13.6 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Is there a page with the whole workflow for paired end data in code or pseudocode? In other words, the code for scythe, then the code needed for sickle, (and maybe sabre), followed by an assembler, eg VelvetOptimiser? Also some files that are mentioned like the adapter file are not well described, even though I have a good hint that it is a simple fasta file of the adapter sequences.

ADD REPLY • link 13.6 years ago by Lee Katz ★ 3.2k

14

Entering edit mode

15.0 years ago

Neilfws 49k

A quick Google search for "fastq trim" throws up:

FASTX-Toolkit - which includes FASTQ/A Trimmer
Working with FASTQ files in Biopython - with concise sample code for trimming
Trimming adaptors from short read sequences - another Biopython example
Mail list discussion on trimming FASTQ using EMBOSS trimest
FASTQ manipulation in Galaxy - includes a trimmer

"A dime a dozen" sounds about right :-) So by all means, share here, but people should be aware that there are are plenty of solutions "out there".

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.0 years ago by Neilfws 49k

1

Entering edit mode

"Trimming" is a catch-all term for several types of operation, including removal of low quality bases, unwanted sequence (e.g adapter/primer/vector/barcodes) and homopolymers (e.g. poly A), or a combination of these.

A universal trimmer would need parameters for different quality encodings, trim position limits and sequence matching algorithms, not to mention being very fast and reliable. I think this helps to explain the explosion of "local" solutions.

Expect more of them if people start to move to BAM format for unaligned reads.

ADD REPLY • link 15.0 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Fastx has uneven documentation. The FastQ/A trimmer (fastx_trimmer) is not quality based. The fastq_quality_trimmer, is quality based, but is not actually mentioned on this page.

ADD REPLY • link updated 6.9 years ago by Ram 45k • written 15.0 years ago by Jeremy Leipzig 23k

6

Entering edit mode

13.6 years ago

lh3 33k

As audyyy's code specifically does quality trimming, I will comment on the quality trimming only.

Phred uses the so-called modifed Richard Mott's algorithm. The Phred documentation gives the implementation details. It trims both ends. The BWA trimming algorithm largely learned from phred with modifications for trimming from one end only. For quality-based trimming, the Phred or the BWA algorithm is preferred. Both of them has a single user defined parameter. They are fast (one-pass scan of the quality string) and fairly easy to implement (the core algorithm in 10 lines of code).

Fastx trims bases with quality below a threshold. Although for Illumina this algorithm may be okay, it is not so robust as the phred or the bwa algorithm. There are also a variety of window-based trimming methods. The problem with these algorithms is that they have more parameters and thus less flexible. My feeling is they are also harder to implement if we want to achieve the speed and quality of the phred or bwa algorithm (although I have not tried).

I have just added quality based trimming in seqtk. It uses the Phred algorithm to trim both ends. If after trimming, the read length is below 30 (by default), it tries to find a 30bp window with the highest sum of quality.

Seqtk is much faster than the existing stand-alone trimmers. It takes 46 CPU seconds to trim a fastq with 100 million lines, while fastx_quality_trimmer takes more than 500 CPU seconds - over 10 times slower. A trimmer in Python/Perl should be even slower as scripting languages are usually very inefficient to loop through each character in a string. You may try.

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 13.6 years ago by lh3 33k

2

Entering edit mode

13.6 years ago

Lee Katz ★ 3.2k

I wrote my own Perl subroutine. The arguments for the subroutine are the shuffled fastq file and a hash reference with options. The options are: numcpus, min_quality (for trimming only), bases_to_trim (max bases that will be trimmed on either side), min_avg_quality, and min_avg_quality_window (window of bases to average together). There are defaults for each of these options. It returns a reference to an array consisting of fastq paired entries (8 lines per array element).

The subroutine relies on two subroutines that you can write on your own: logmsg which prints out a status message, and LKUtils::getNumCPUs() which doesn't matter if you supply the number of CPUs.

So, I can't say that this is the best piece of work, and it lacks a way to take care of singletons like the UC Davis software. However, it looks like it is working, and it is kind of fast because of multithreading. Taking care of singletons in a large dataset could be a minor issue. Because I am not the best programmer ever and because I only tested it a little, any feedback is appreciated.

	use threads;
	use Thread::Queue;
	# use YourUtils qw/logmsg getNumCPUs/;
	my $seqs=qualityTrimFastq("raw.fastq",{numcpus=>4,min_avg_quality=>20});
	for(@$seqs){
	print $_; # => an entire 8-line entry ($id$seq\n+\n$qual\n, for each of the two reads in the pair)
	}

	# reads a shuffled paired end fastq and trims it
	# and cleans it
	# Author: Lee Katz <lkatz@cdc.gov>
	sub qualityTrimFastq($;$){
	my($fastq,$settings)=@_;
	$$settings{numcpus}\|\|=LKUtils::getNumCPUs();
	# trimming
	$$settings{min_quality}\|\|=30;
	$$settings{bases_to_trim}\|\|=10; # max number of bases that will be trimmed on either side
	# cleaning
	$$settings{min_avg_quality}\|\|=20;
	$$settings{min_avg_quality_window}\|\|=70;
	$$settings{min_length}\|\|=62; # twice hash length sounds good

	my (@t,$t);
	my $Q=Thread::Queue->new;
	for(0..$$settings{numcpus}-1){
	$t[$t++]=threads->create(sub{
	logmsg "Launching thread TID".threads->tid;
	my(@entryOut);
	while(defined(my $entry=$Q->dequeue)){
	my($id1,$seq1,undef,$qual1,$id2,$seq2,undef,$qual2)=split(/\s\n\s/,$entry);
	my %read1=(id=>$id1,seq=>$seq1,qual=>$qual1,length=>length($seq1));
	my %read2=(id=>$id2,seq=>$seq2,qual=>$qual2,length=>length($seq2));
	# quality trim: read 1, 3' read
	my($numToTrim3,$numToTrim5,@qual);
	$numToTrim3=$numToTrim5=0;
	@qual=map(ord($_)-33,split(//,$read1{qual}));
	for(my $i=0;$i<$$settings{bases_to_trim};$i++){
	$numToTrim3++ if($qual[$i]<$$settings{min_quality});
	last if($qual[$i]>=$$settings{min_quality});
	}

	$read1{seq}=substr($read1{seq},$numToTrim3,$read1{length}-$numToTrim3-$numToTrim5);
	$read1{qual}=substr($read1{qual},$numToTrim3,$read1{length}-$numToTrim3-$numToTrim5);
	# quality trim: read 1, 5' read
	$numToTrim3=$numToTrim5=0;
	@qual=map(ord($_)-33,split(//,$read2{qual}));
	for(my $i=0;$i<$$settings{bases_to_trim};$i++){
	$numToTrim3++ if($qual[$i]<$$settings{min_quality});
	last if($qual[$i]>=$$settings{min_quality});
	}
	for(my $i=$read2{length}-$$settings{bases_to_trim};$i<@qual;$i++){
	$numToTrim5++ if($qual[$i]<$$settings{min_quality});
	last if($qual[$i]>=$$settings{min_quality});
	}
	$read2{seq}=substr($read2{seq},$numToTrim3,$read2{length}-$numToTrim3-$numToTrim5);
	$read2{qual}=substr($read2{qual},$numToTrim3,$read2{length}-$numToTrim3-$numToTrim5);

	#cleaning stage
	# TODO keep singletons
	next if(length($read1{seq})<$$settings{min_length});
	next if(length($read2{seq})<$$settings{min_length});
	# avg quality
	my $qual_is_good=1;
	QUALITY_CHECK:
	for my $qualStr ($read1{qual},$read2{qual}){
	my $length=length($qualStr)-$$settings{min_avg_quality_window};;
	my @qual=map(ord($_)-33,split(//,$qualStr));
	for(my $i=0;$i<$length;$i++){
	my @subQual=@qual[$i..($$settings{min_avg_quality_window}+$i-1)];
	my $avgQual=sum(@subQual)/$$settings{min_avg_quality_window};
	if($avgQual<$$settings{min_avg_quality}){
	#print "$i: ".join(".",@subQual)."\n$avgQual <=> $$settings{min_avg_quality}\n";die;
	$qual_is_good=0;
	last QUALITY_CHECK;
	}
	}
	}
	next if(!$qual_is_good);

	my $entryOut="$read1{id}\n$read1{seq}\n+\n$read1{qual}\n$read2{id}\n$read2{seq}\n+\n$read2{qual}\n";
	push(@entryOut,$entry);
	}
	return \@entryOut;
	});
	}

	my $entryCount=0;
	open(FQ,'<',$fastq) or die "Could not open $fastq for reading: $!";
	while(my $entry=<FQ>){
	$entry.=<FQ> for(1..7); # 8 lines total for paired end entry
	$Q->enqueue($entry);
	$entryCount++;
	logmsg "Finished loading $entryCount pairs" if($entryCount%100000==0);
	#if($entryCount>99999){
	# logmsg "DEBUG: only $entryCount loaded";
	# last;
	#}
	}
	close FQ;
	$Q->enqueue(undef) for(1..$$settings{numcpus});

	# status update
	while(my $p=$Q->pending){
	logmsg "$p entries left to process";
	sleep 10;
	}

	my @entry;
	for(@t){
	my $tmp=$_->join;
	push(@entry,@$tmp);
	}

	my $numCleaned=$entryCount-@entry;
	logmsg "$numCleaned entries out of $entryCount removed due to low quality";
	return \@entry;
	}

view raw biostars_16336.pl hosted with ❤ by GitHub

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 13.6 years ago by Lee Katz ★ 3.2k

Login before adding your answer.