Question

Removing reads shorter than 10kb from FASTQ files

0

Entering edit mode

7.5 years ago

Ric ▴ 440

Hi, How is it possible to remove reads shorter than 10kb from FASTQ files. I found how to do it FASTA files but not for FASTQ files

bioawk -c fastx '{ if(length($seq) > 10000) { print ">"$name; print $seq }}' bwa/unmapped_${output}.fasta > bwa/unmapped_${output}-gt-10000 .fasta

Thank you in advance.

fastq • 6.0k views

ADD COMMENT • link updated 6.9 years ago by Alex Reynolds 36k • written 7.5 years ago by Ric ▴ 440

1

Entering edit mode

Wow.

I am trying to think of an experiment/analysis that would be strictly benefited by removing reads under 10kbp, and I'm drawing blanks. I don't work with Nanopore often, but I do often work with PacBio, and... well... no, I can't imagine a scenario. There are scenarios in which software for long, low-accuracy reads gives better results when people throw away short reads. But that is always a flaw in the software (in which case, you should complain and demand better software, rather than throwing away data); and by "short", I mean >500bp or so. I think, if you throw away 10kbp reads for any experiment because they are too short, you're doing it wrong.

ADD REPLY • link 7.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi, from my PacBio reads, I removed contaminations such chloroplast and some of the reads are very shortly afterwards. Do you think in this case it is a good idea to remove reads which are shorter than 500kb or 1000kb?

ADD REPLY • link 7.5 years ago by Ric ▴ 440

0

Entering edit mode

That depends on what you are doing, but generally no (I assume you mean bp, not kbp).

ADD REPLY • link 7.5 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2018-02-05

Via Essential AWK Commands for Next Generation Sequence Analysis:

$ awk 'NR%4==1{a=$0} NR%4==2{b=$0} NR%4==3{c=$0} NR%4==0&&length(b)>10000{print a"\n"b"\n"c"\n"$0;}' file.fq > result.fq

If your FASTQ is compressed:

$ gunzip -c file.fqz | awk 'NR%4==1{a=$0} NR%4==2{b=$0} NR%4==3{c=$0} NR%4==0&&length(b)>10000{print a"\n"b"\n"c"\n"$0;}' - | gzip -c - > result.fqz

score 0 · Answer 2 · 2017-07-06

0

Entering edit mode

7.5 years ago

badribio ▴ 290

check this Filtering Fastq Sequences Based On Lengths and this fastq file on the basis of read length

ADD COMMENT • link 7.5 years ago by badribio ▴ 290

score 0 · Answer 3 · 2017-07-06

0

Entering edit mode

7.5 years ago

WouterDeCoster 47k

You can do this using my NanoFilt tool. It's written for Oxford Nanopore sequencing data, but there is no reason that it wouldn't work for anything else. More information on filtering also on my blog Gigabase or gigabyte.

ADD COMMENT • link 7.5 years ago by WouterDeCoster 47k