Question

Convert FASTQ to FASTA, filter, then convert filtered FASTA back into FASTQ

0

Entering edit mode

10.6 years ago

melonpwn ▴ 10

Hi all,

I have a fastq file that I want to filter with seqclean to remove low complexity reads. However, seqclean only accepts fasta as input, so I convert the fastq to fasta, and then filter it with seqclean. But then I want to convert the filtered fasta back into a fastq file with the quality scores from the original fastq file, but only for the reads in the filtered fasta.

Is there a tool that will allow me to do something like this?

Thanks in advance.

sequencing genome seqclean conversion • 6.8k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by melonpwn ▴ 10

2

Entering edit mode

10.6 years ago

Pierre Lindenbaum 166k

(not tested)

linearize the fastq and sort on name:

gunzip -c file.fastq.gz | paste - - - -  | LC_ALL=C sort -t '\t' -k1,1 > file1

create the fastq and process with your tool, linearize the fasta, remove the heading '>'

awk -F '\t' '{printf(">%s\t%s\n",$1,$2);}' file1 | tool | awk '/^>/ { print("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' | cut -c2- | LC_ALL=C sort -t '\t' -k1,1 > file2

join both files and restore a fastq file

join -t '\t' -1 1 -2 1 file1 file2 | cut -f (required-columns)| | tr "\t" "\n" > output.fastq

ADD COMMENT • link 10.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

This is neat but I don't think this approach would work. :)

The "tool" would have to read/write tab-delimited sequence files so you'd likely have to write that, and the last step just restores the original fastq record, correct? If that is the case then you would just be undoing the work in the "tool" part of the pipeline. I could be misunderstanding the last part though, so apologies if I'm not correct.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by SES 8.6k

0

Entering edit mode

10.6 years ago

Biomonika (Noolean) 3.2k

yes, get all fasta headers from filtered files

grep ">" filtered.fasta >ids

and use seqtk to extract them from original fastq files

seqtk subseq original.fastq ids

This assumes that all you headers are unique.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 10.6 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

This did not work, just gave me an empty file.

ADD REPLY • link 10.6 years ago by melonpwn ▴ 10

0

Entering edit mode

"This did not work, just gave me an empty file." -> because What Are The Most Common Stupid Mistakes In Bioinformatics??

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Not the grep part, the seqtk part.

When I run seqtk subseq original.fastq ids (and the ids are in fact there), it returns nothing.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by melonpwn ▴ 10

0

Entering edit mode

Figured it out -- just remove all the '>' and it will work.

ADD REPLY • link 10.6 years ago by melonpwn ▴ 10

0

Entering edit mode

Sorry, forgot about that.

ADD REPLY • link 10.6 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

This just pulls the records from the FASTQ file right? In this case, you've wasted your time doing the trimming/filtering if you just go back to the original record.

ADD REPLY • link 10.6 years ago by SES 8.6k

0

Entering edit mode

Correct. I understood from the question that no trimming is performed, just filtering of whole reads.

ADD REPLY • link 10.6 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

No worries. I just didn't want melonpwn to think that the command was doing exactly what they wanted just because it "worked" and produced some output.

ADD REPLY • link 10.6 years ago by SES 8.6k

0

Entering edit mode

10.6 years ago

Brian Bushnell 20k

You can do that with BBTools like this:

bbmask.sh in=reads.fq out=masked.fq mle=t ke=5 window=80 entropy=0.75
bbduk.sh in=masked.fq out=final.fq maxns=10

The first step will convert low-complexity sequences to N, and the second step will remove reads containing at least 10 Ns. To fine-tune the complexity limit, you can adjust window and entropy. The higher the entropy setting, the more will be masked; the window is the area over which entropy is calculated and must be less than the read length.

I have another tool (filterbyname.sh) that can be used to filter a fastq file to retain only the reads present in another (e.g. fasta) file, but the current public version only allows filtering by a list of names. The internal version now supports filtering based on another file of reads but I don't know when I'll release it; probably in a few days.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 10.6 years ago by Brian Bushnell 20k

0

Entering edit mode

This gives me the following error:

Input is being processed as unpaired
Started output streams: 0.013 seconds.
Exception in thread "main" java.lang.NullPointerException
        at jgi.BBDukF.spawnProcessThreads(BBDukF.java:1414)
        at jgi.BBDukF.process2(BBDukF.java:780)
        at jgi.BBDukF.process(BBDukF.java:716)
        at jgi.BBDukF.main(BBDukF.java:65)

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.6 years ago by melonpwn ▴ 10

0

Entering edit mode

Sorry, that was a bug in version 33.92. It works in the latest version, 34.00.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Brian Bushnell 20k

Ram · Accepted Answer · 2014-12-04

There are obviously a number of ways to get there (I also wrote a script for going from trimmed fasta back to fastq which is here), but I recommend you rethink using SeqClean. PRINSEQ is a much better tool, it will do low complexity and quality filtering and produce a beautiful HTML file with plots of the results. That is the single reason to use this software. Most trimming tools complete silently and leave you wondering what happened, then you have to spend time summarizing the results yourself. Nonsense! Just use PRINSEQ and you'll always have a record of your data and the trimming results (explained in the manual). Also, I don't think SeqClean will respect paired-end data, but PRINSEQ will.