Convert FASTQ to FASTA, filter, then convert filtered FASTA back into FASTQ
4
0
Entering edit mode
10.1 years ago
melonpwn ▴ 10

Hi all,

I have a fastq file that I want to filter with seqclean to remove low complexity reads. However, seqclean only accepts fasta as input, so I convert the fastq to fasta, and then filter it with seqclean. But then I want to convert the filtered fasta back into a fastq file with the quality scores from the original fastq file, but only for the reads in the filtered fasta.

Is there a tool that will allow me to do something like this?

Thanks in advance.

sequencing genome seqclean conversion • 6.0k views
ADD COMMENT
2
Entering edit mode
10.1 years ago
SES 8.6k

There are obviously a number of ways to get there (I also wrote a script for going from trimmed fasta back to fastq which is here), but I recommend you rethink using SeqClean. PRINSEQ is a much better tool, it will do low complexity and quality filtering and produce a beautiful HTML file with plots of the results. That is the single reason to use this software. Most trimming tools complete silently and leave you wondering what happened, then you have to spend time summarizing the results yourself. Nonsense! Just use PRINSEQ and you'll always have a record of your data and the trimming results (explained in the manual). Also, I don't think SeqClean will respect paired-end data, but PRINSEQ will.

ADD COMMENT
2
Entering edit mode
10.1 years ago

(not tested)

linearize the fastq and sort on name:

gunzip -c file.fastq.gz | paste - - - -  | LC_ALL=C sort -t '\t' -k1,1 > file1

create the fastq and process with your tool, linearize the fasta, remove the heading '>'

awk -F '\t' '{printf(">%s\t%s\n",$1,$2);}' file1 | tool | awk '/^>/ { print("\n%s\t",$0);next;} {printf("%s",$0);} END {printf("\n");}' | cut -c2- | LC_ALL=C sort -t '\t' -k1,1 > file2

join both files and restore a fastq file

join -t '\t' -1 1 -2 1 file1 file2 | cut -f (required-columns)| | tr "\t" "\n" > output.fastq
ADD COMMENT
0
Entering edit mode

This is neat but I don't think this approach would work. :)

The "tool" would have to read/write tab-delimited sequence files so you'd likely have to write that, and the last step just restores the original fastq record, correct? If that is the case then you would just be undoing the work in the "tool" part of the pipeline. I could be misunderstanding the last part though, so apologies if I'm not correct.

ADD REPLY
0
Entering edit mode
10.1 years ago

yes, get all fasta headers from filtered files

grep ">" filtered.fasta >ids

and use seqtk to extract them from original fastq files

seqtk subseq original.fastq ids

This assumes that all you headers are unique.

ADD COMMENT
0
Entering edit mode

This did not work, just gave me an empty file.

ADD REPLY
0
Entering edit mode

"This did not work, just gave me an empty file." -> because What Are The Most Common Stupid Mistakes In Bioinformatics??

ADD REPLY
0
Entering edit mode

Not the grep part, the seqtk part.

When I run seqtk subseq original.fastq ids (and the ids are in fact there), it returns nothing.

ADD REPLY
0
Entering edit mode

Figured it out -- just remove all the '>' and it will work.

ADD REPLY
0
Entering edit mode

Sorry, forgot about that.

ADD REPLY
0
Entering edit mode

This just pulls the records from the FASTQ file right? In this case, you've wasted your time doing the trimming/filtering if you just go back to the original record.

ADD REPLY
0
Entering edit mode

Correct. I understood from the question that no trimming is performed, just filtering of whole reads.

ADD REPLY
0
Entering edit mode

No worries. I just didn't want melonpwn to think that the command was doing exactly what they wanted just because it "worked" and produced some output.

ADD REPLY
0
Entering edit mode
10.1 years ago

You can do that with BBTools like this:

bbmask.sh in=reads.fq out=masked.fq mle=t ke=5 window=80 entropy=0.75
bbduk.sh in=masked.fq out=final.fq maxns=10

The first step will convert low-complexity sequences to N, and the second step will remove reads containing at least 10 Ns. To fine-tune the complexity limit, you can adjust window and entropy. The higher the entropy setting, the more will be masked; the window is the area over which entropy is calculated and must be less than the read length.

I have another tool (filterbyname.sh) that can be used to filter a fastq file to retain only the reads present in another (e.g. fasta) file, but the current public version only allows filtering by a list of names. The internal version now supports filtering based on another file of reads but I don't know when I'll release it; probably in a few days.

ADD COMMENT
0
Entering edit mode

This gives me the following error:

Input is being processed as unpaired
Started output streams: 0.013 seconds.
Exception in thread "main" java.lang.NullPointerException
        at jgi.BBDukF.spawnProcessThreads(BBDukF.java:1414)
        at jgi.BBDukF.process2(BBDukF.java:780)
        at jgi.BBDukF.process(BBDukF.java:716)
        at jgi.BBDukF.main(BBDukF.java:65)
ADD REPLY
0
Entering edit mode

Sorry, that was a bug in version 33.92. It works in the latest version, 34.00.

ADD REPLY

Login before adding your answer.

Traffic: 3498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6