Dear All,
Am I right to use the following script to transform a fastq file (named test.fastq) to a fasta file? THANKS a lot!
#!/usr/bin/perl
use strict;
use Bio::SeqIO;
my $in=Bio::SeqIO->new(-file=>"test.fastq",-format=>'fastq');
my $out=Bio::SeqIO->new(-file=>">test.fasta",-format=>'fasta');
while(my $seq=$in->next_seq()){
$out->write_seq($seq);
}
In the list, seqtk, bioawk and seqret work with multi-line fastq; the rest don't. If you just want to use the standard unix tools, rtliu's sed solution is preferred, both short and efficient. It should be noted that file t.fq is put in /dev/shm and the results are written to /dev/null. In real applications, I/O may take more wall-clock time than CPU. In addition, frequently the sequence file is gzip'd. For seqtk, decompression takes more CPU time than parsing fastq.
Additional comments:
SES observed that seqret was faster than Irsan's command. At my hand, seqret is always slower than most. Is it because of version, locale or something else?
I have not tried the native bioperl parser. Probably it is much slower. Using bioperl on large fastq is discouraged.
I do agree 4-line fastq is much more convenient for many people. However, fastq is never "defined" to consist of 4 lines. When it was first introduced at the Sanger Institute for ~1kb capillary reads, it allows multiple lines.
Tools in many ancient unix distributions (e.g. older AIX) do not work with long lines. I was working with a couple of such unix even in 2005. I guess this is why older outputs/formats, such as fasta, blast and genbank/embl, used short lines. This is not a concern any more nowadays.
To convert multi-line fastq to 4-line (or multi-line fasta to 2-line fasta): seqtk seq -l0 multi-line.fq > 4-line.fq
Thanks for the evaluation. I have tried the latest emboss on the same old CentOS server. Its speed is pretty much the same. I guess these two command lines are sensitive to the system configuration. As to the "paste" command line, could you confirm that the locale is set to "C"?
I checked and my locale on both machines is set to UTF-8, which I read is the default for most linux systems. I also read somewhere that this affects the sort order, but I'm uncertain about what that means exactly. EDIT: Here's what the manpage of sort says: "** WARNING ** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values." I guess that means sort could behave differently on different systems. This is interesting, I'll keep this information in mind.
You may set "export LC_ALL=C" and then see if the speed changes. Sometimes, locale has a dramatic hit to performance on text processing. For example, regex matching on UTF-8 is much slower than on ASCII.
Don't use awk/shell to parse fastq. None of those solutions work for multi-line records (except seqtk). You can save yourself a lot of trouble by using a toolkit like EMBOSS for this:
seqret -sequence reads.fastq -outseq reads.fasta
Since you were trying for a BioPerl solution, here is the same thing using the bioperl-run package:
If you are learning your way around NGS data, I recommend using the established tookits. BioPerl is helpful, but not for parsing fastq data because it's just too slow.
EDIT: fix link that broke due to conversion to the new website.
ADD COMMENT
• link
updated 5.2 years ago by
Ram
44k
•
written 11.2 years ago by
SES
8.6k
0
Entering edit mode
"Don't use awk/shell to parse fastq. None of those solutions work for multi-line records " I do agree. But (1) no dependency (2)most of the time, the fastq have 4 lines.
I agree on the first point, BioPerl and EMBOSS are both really large. But, if someone is learning to parse these files, I don't think handing them a long shell pipeline is the best way to learn. Learning these tools allows you spend less time writing and debugging basic parsing code so you can focus on higher level problems. (And, BioPerl was a part of the question).
(1) By definition one fastq entry consists of 4 lines and (2) for simple tasks like converting fastq files to fasta files I absolutely recommend using command line pipes for learning and understanding the formats. Using and blindly trusting huge packages is dangerous, especially when you are new to the field. BUT, of course, one needs to check check check and double-check the results. But that's bioinformatics!
1) In my experience, we can't assume this about formats. The first line of the FASTQ definition on wikipedia says, "A FASTQ file normally uses four lines per sequence." Normally, but not always. The convention for FASTA is for the sequence part to be less than 80 columns wide, and that's what many applications (like EMBOSS) will generate with FASTQ, as well. I wish it was always 4 lines, that would be so much easier. 2) I'd be interested in what others think, but I would say learn a scripting language, not shell, for parsing. It's almost 2014 and modern scripting languages are way more expressive and nearly as fast as compiled languages for many tasks. If you do everything with these pipes you will be debugging the next time you use a command, and you will look at it and wonder what in the world it does. I used to do that but wouldn't advise it. Also, that seqret command I posted is 2X faster than Irsan's answer, and much, much shorter. It's not as fast as Pierre's answer, his is 10X faster than Irsan's, because he has super-shell skills.
In your experience, you can't assume this about formats. :) But how did you learn that? I'm 100% with you... in bioinformatics the file formats are not really defined. But you will never learn that by using toolkits. If you want to learn that by shell pipes, or a scripting language doesn't matter in my view. I use shell scripting to fastly check things, or for fastq to fasta conversions and scripting languages for mostly all other tasks... Depends on what you need. 10x faster is nice, if you need a script more often. Just to check something, I don't care about speed.
I learned about seeing multi-line fastq from getting data from collaborators. A lot of biologists rely on tools like Geneious, CLC, and other desktop applications. These often produce files that break all parsers by inserting extra information or characters that only that application can read, or by doing things like breaking the lines of all sequence data at a certain length. I found this because I kept getting errors from a sequence reading class I wrote that was not recognizing the format as fastq. This is very frustrating, but I decided to keep it in mind and not assume, instead of wasting more time the future. I don't think speed is most important either, I'd rather go with methods that are easy to remember and use. For doing simple things, it does not matter what your approach is, but I don't see the point of making a procedure slower and more complicated.
That solution is slower (2-10X slower for ~10 million sequences) than all the other answers. That doesn't really matter unless you have like 100 files to convert, then it would be silly to use something that you know will take hours longer. By complicated, I mean the command is longer (than seqtk or seqret) and requires five programs as opposed to one. I'm not disagreeing about using shell for changing formats, those are crucial skills to have. Though, I just have a hard time saying that shell is the better approach in this case, given those two points. Also, the original question was about how to get a bioperl script working, so I don't think giving a shell command is the most specific answer. I think the question has been answered about 10 times now, so we are getting off topic talking about what we would do, but hopefully that helps, too. :) Cheers.
Very late indeed, as there already is an almost identical answer, posted years ago (here is the direct link to the answer). The difference being your answer has sed 's/^@/>/', while the posted has sed 's/^@/>/g'.
What does your script do that other, more well-tested tools don't? As in, why should someone use this script instead of, say, a simple awk or sed or bioawk (Or anything from lh3's benchmarked answer)? The script won't work with compressed files, there is no performance optimization - to me that seems like a list of downsides/risks with no gain.
The upsides: 1. it can do reads trimming at the same time, which is frequently required in small RNA fields
2. a functional command is short
3. it's in python
But I have to admit that Irsan's answer is good enough.
How well tested and benchmarked are these functionalities? How is the tool being in python an advantage when tools written in C/C++ are a lot more performant in File IO?
EDIT: I apologize for coming off a little too confrontational. I think it would help to test and benchmark your tool to showcase how fast and/or accurate it is.
Thanks for contributing! It's probably helpful to potential users if you can provide some more documentation, usage, installation, dependencies, advantages,... in your post here.
why so complicated ?
Why so complicated? :)
seqkit fq2fa in.fastq.gz -o out.fasta
https://bioinf.shenwei.me/seqkit/usage/#fq2fa
and 2.4 years later...
Hi Pierre, I recently learned that the '@' from a fastq file is trouble when it's left in the fasta file for alignment... https://github.com/samtools/samtools/issues/773
So maybe your oneliner could use another update almost two years later... :)
Hi Wouter,
Here's an update with '@' removal and multi-threaded unzipping thrown in as a bonus. https://zlib.net/pigz/
Common problem for new folk, no need for a perl script to do this, built in commands like that posted by Pierre to use awk are good.
Or this might be a little easier to remember / type:
http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_to_fasta_usage
Can we also apply this for whole folder. so that all file can we converted at once..
Thanks
Sandeep