Question

Trim Certain Nucleotides Off Fastq Reads

0

Entering edit mode

12.4 years ago

bbio ▴ 90

Let's say I have a fastq file with NGS reads and I want to trim off all trailing Cs, such that ACGTCCC becomes ACGT but ACGTCCG stays the same. I do not want to trim by quality or anything like that and the amount of Cs may differ from read to read. Are there any tools out there that are capable of doing this? Also, are there any potential problems I might run into when I then align these reads to a genome using BWA?

fastq next-gen trimming • 4.7k views

ADD COMMENT • link updated 9.3 years ago by Brian Bushnell 20k • written 12.4 years ago by bbio ▴ 90

0

Entering edit mode

If the poly-C tails are product of errors in sequencing (and maybe it's true), it's better to trim your reads before mapping.

ADD REPLY • link 12.4 years ago by JC 13k

score 2 · Answer 1 · 2012-11-15

2

Entering edit mode

12.4 years ago

Istvan Albert 102k

I don't know of a tool but it is something that one could write in python/perl/ruby, here is an attempt to do it looks like it would take 10 seconds per million of reads:

import re, string, itertools

# splitting pattern
patt = re.compile("C+$")

# open stream
stream = file("data.fq")

# strip ending new lines
stream = itertools.imap(string.strip, stream)

# loop over the file four lines at a time
for id in stream:
    seq  = stream.next()
    tmp  = stream.next()
    qual = stream.next()

    # attempt to split the string at pattern
    seq = re.split(patt, seq)[0]

    # resize quality string to match sequence
    qual = qual[:len(seq)]

    # output the trimmed fastq
    print id
    print seq
    print tmp
    print qual

ADD COMMENT • link 12.4 years ago by Istvan Albert 102k

0

Entering edit mode

benchmark with 1 million reads, 50b long, my Perl (v5.10.0) code vs your Python (v2.6.1) code

 $ time perl trim.pl < data.fq > out1
 real 0m10.211s
 user 0m6.857s
 sys 0m0.668s
 $ time python trim.py > out2
 real 0m15.798s
 user 0m13.552s
 sys 0m0.607s

Data used: https://dl.dropbox.com/u/386663/data.fq.gz (46 MB)

ADD REPLY • link 12.4 years ago by JC 13k

0

Entering edit mode

Tried to optimize this alas as I discovered python has innate disadvantage at the IO level. It seems that a simple code that strips each line and outputs it again runs almost three times slower in python than perl.

import sys
for line in sys.stdin:
    line = line.strip()
    print line

vs the equivalent:

while (<>) {
   chomp;
   print "$_\n";
}

will yield

$ time python ia2.py < data.fq  > a

real    0m4.222s
user    0m4.149s
sys     0m0.030s

$ time perl jc1.pl < data.fq  > b

real    0m1.554s
user    0m1.404s
sys     0m0.124s

so that's a starting disadvantage I cannot optimize nor can I overcome.

ADD REPLY • link 12.4 years ago by Istvan Albert 102k

0

Entering edit mode

Great, thank you both very much! I probably should have just written it myself in the first place, I was just wondering if there was a proper tool for this out there already.

ADD REPLY • link 12.4 years ago by bbio ▴ 90

0

Entering edit mode

Well, Perl won this one, last time you and Matted crushed it.

ADD REPLY • link 12.4 years ago by JC 13k

score 2 · Answer 2 · 2012-11-15

Perl option:

#!/usr/bin/perl

use strict;
use warnings;

my $nl  = 0;
my $pos = 0;
my $len = 0;
while (<>) {
    chomp;
    $nl++;
    if ($nl == 2) {
        if (m/C+$/) {
            $pos = $-[0];
            $len = $+[0] - $-[0];
            substr($_, $pos, $len) = '';
        }
        else {
            $len = 0;
        }
    }
    elsif ($nl == 4) {
        if ($len > 0) {
            substr($_, $pos, $len) = '';
        }
        $nl = 0;
   }
   print "$_\n";
}

score 1 · Answer 3 · 2015-12-14

1

Entering edit mode

9.3 years ago

Brian Bushnell 20k

Using BBDuk:

bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=31 mink=1 rcomp=f literal=CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

real    0m1.464s
user    0m5.039s
sys     0m0.781s

ADD COMMENT • link 9.3 years ago by Brian Bushnell 20k