!/usr/bin/perl

Question

Whole-Exome Data Cleanup / Filtering

2

Entering edit mode

13.7 years ago

Khader Shameer 18k

What tools do you you use to clean up your whole-exome paired end reads: I am running a FastQC check for various parameters followed by a set of fastx_* commands as follows:

cat fastq_maq_ill2sanger/P1_R1_sanger.fastq fastqc|\
fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC |\
fastx_clipper -Q 33 -l 20 -v -a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC |\
fastq_quality_trimmer -Q 33 -t 20 -l 20 -v |\
fastx_artifacts_filter -Q 33 -v |\
fastq_quality_filter -Q 33 -q 20 -p 50 -v -o fastx_out_clean/P1_R1_fastx.fastq;

Then again run the FastQC to make sure the reads are fine.

Is this looks adequate for whole exome paired end reads ? Am I missing any other parameters or options ? This step takes around 2 hours in our server per sample, Is there any faster methods to do the clean up.

exome next-gen sequencing fastx • 5.5k views

ADD COMMENT • link updated 8.6 years ago by Biostar 20 • written 13.7 years ago by Khader Shameer 18k

0

Entering edit mode

2 hours per read? :-) hope you don't have any deadlines in the next few hundred years.

ADD REPLY • link 13.7 years ago by brentp 24k

0

Entering edit mode

Oops. I mean per "sample" :), now corrected in the question. !

ADD REPLY • link 13.7 years ago by Khader Shameer 18k

0

Entering edit mode

2 CPU hours? I think that is a reasonable speed, at least way faster than mapping.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

It only takes 2 hours for your samples? We do something very similar (fastqc, fastx_clipper, fastq_quality_trimmer, fastqc again), and I've found that for our samples these steps can take up to 3X longer than alignment with BWA. It's on the order of 12 hours per sample for our Hiseq read sets.

ADD REPLY • link 13.7 years ago by Jeremy Schwartzentruber ▴ 20

Ram · Answer 1 · 2011-05-09

We do the same thing but add a pairFixer script. Basically it goes through the pair files twice to fix the pairs. The first time to see all the headers(just the common part) that are seen twice and a second time to create these files: pair1 pair2 singles

I also added a -M12 to the clipper part to tolerate non-full length adapters but maybe 12 is a bit too short.

[HTML][HTML]

!/usr/bin/perl

use strict;

my $outName = $ARGV[0]; my $pair1 = $ARGV[1]; my $pair2 = $ARGV[2]; my $pair1FH; my $pair2FH; my $pair1outFH; my $pair2outFH; my $pairSingleFH;

open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; my @pairedFiles = ($pair1FH,$pair2FH);

my %readCount; my $totalRecords = 0; for my $pairFH (@pairedFiles) { while (my $head = <$pairFH>) { <$pairFH>; <$pairFH>; <$pairFH>; $totalRecords++; my $key = substr($head, 0, length($head)-3); my $value = $readCount{$key}; if(defined $value) { $value++; $readCount{$key} = $value; } else { $readCount{$key} = 1; } } close($pairFH); }

open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; @pairedFiles = ($pair1FH,$pair2FH); open ($pair1outFH, ">$outName.pair1.fastq") || die "Cannot Open pair out 1 File"; open ($pair2outFH, ">$outName.pair2.fastq") || die "Cannot Open pair out 2 File"; open ($pairSingleFH, ">$outName.single.fastq") || die "Cannot Open single File"; my @outputFiles = ($pair1outFH,$pair2outFH);

my $idx=0; my $records=0; my $pairs=0; my $single=0; print "Copying reads\n"; for my $pairFH (@pairedFiles) { my $outFH = $outputFiles[$idx]; $idx++;

while(my $head = <$pairFH>) { my $seq = <$pairFH>; my $head2 = <$pairFH>; my $qual = <$pairFH>;

my $key = substr($head, 0, length($head)-3);
my $value = $readCount{$key};
my $fh;
$records++;
if($value == 1) {
  $fh = $pairSingleFH;
  $single++;
}
elsif($value > 1) {
  $fh = $outFH;
  $pairs++;
}
else {
  die "No value found for: $key\n";
}

print $fh $head;
print $fh $seq;
print $fh $head2;
print $fh $qual;

print "\rTotal: $totalRecords, Records: $records, Pairs: $pairs, Single: $single ";

} close($outFH); close($pairFH); } close($pairSingleFH); print "Total: $totalRecords, Records: $records, Pairs: $pairs, Single: $single\n"; [HTML] [HTML]

score 2 · Answer 2 · 2011-05-06

2

Entering edit mode

13.7 years ago

brentp 24k

Haibao Tang has a program that will accept a FASTA file of adaptors and trim those all in one go. It does some sort of local alignment on the adaptors so it will be more thorough tan the fastx_clipper and it can also trim low quality reads.

ADD COMMENT • link 13.7 years ago by brentp 24k

0

Entering edit mode

Thanks Brent !!

ADD REPLY • link 13.7 years ago by Khader Shameer 18k

score 1 · Answer 3 · 2011-05-06

1

Entering edit mode

13.7 years ago

Jeremy Leipzig 23k

i think those fastx_clipper commands are not pair-friendly - i.e. if one of your sequences is "all adapter" you will orphan the mate and mess up the pairing.

ADD COMMENT • link 13.7 years ago by Jeremy Leipzig 23k

0

Entering edit mode

That's a valid point Jeremy, I am getting "weird pairing" warning while running bwa sampe using fastx_* processed fastq files.

ADD REPLY • link 13.7 years ago by Khader Shameer 18k

0

Entering edit mode

Actually I do not think it's fastx's fault. What do you get if you just feed the raw reads to bwa?

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

@lh3: raw reads runs without problem. I am not able to run bwa sampe (v 0.5.9-r16) using the fastx_ generated fastq files. bwa is keep running for several hours without creating any output. I found this post and tried with an older version (v 0.5.5), then it was running but showing the weird pairing. I tried bwa sampe runs several time, once it ran for about 24 hrs, finally I quit the run. Because of this issue am now re-checking my previous steps. Any thoughts on why bwa sampe not running on fastx_ ?

ADD REPLY • link 13.7 years ago by Khader Shameer 18k

0

Entering edit mode

I see. If without fastx* bwa runs fine, it should be fastx's problem. I have never used fastx, so do not know what was happening.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

Jeremy: Do you have any suggestion on a program that can deal with the pairing issues here ?

ADD REPLY • link 13.6 years ago by Khader Shameer 18k