Has anyone written a script to trim fastq files? I read in another post somewhere that these scripts are a dime a dozen. So, could we share them to prevent people from eternally reinventing the wheel?
Here's mine so far, which is a modified version of a modified version of Richard Mott's trimming algorithm used in CLC Genomics Workbench:
def mott(dna, **kwargs):
''' Trims a sequence using modified Richard Mott algorithm '''
try:
limit = kwargs['limit']
except KeyError:
limit = 0.05
tseq = []
S = 0
pos = 0
for q, n in zip(dna.qual, dna.seq):
pos += 1
#if q == 'B': continue
Q = ord(q) - 64
p = 10**(Q/-10.0)
s = S
S += limit - p
if S < 0: S = 0
#print '%-3s %-3s %-3s %-3s %4.3f %6.3f %7.3f' % (pos, n, q, Q, p, limit - p, S),
if s > S:
break
else:
tseq.append(n)
dna.seq = ''.join(tseq)
dna.qual = dna.qual[0:len(dna.seq)]
return dna
I should mention that dna
is an object I've created which has some properties: seq
is the sequence and qual
is the quality scores (their ascii representation). This algorithm only works for Illumina 1.5 PHRED scores.
I'm using it in a pipeline for metagenome analysis of Illumina data. I'm also writing a median sliding window algorithm to see if that works better.
what's wrong with the stuff in fastx toolkit?
I'd also suggest that whilst the code golf is fine, and posting code to ask for help/suggestions is fine, dumping code snippets here when there are a dozen implementations elsewhere is not what BioStar should be used for. Stick it on github.
I don't know who Richard Mott is but it would be helpful if you describe algorithms in English, e.g. "this trims all base pairs to the right of the first position where quality drops more than 'limit'"
It would be better to choose an existing software to do Illumina trimming rather than reinventing the wheel. Considering up-to-date precision and speed benchmarks, I'd recommend atria.