Question

Tool:FASTA and FASTQ tools

27

Entering edit mode

10.0 years ago

Kamil ★ 2.3k

Many developers have created tools for manipulating FASTA and FASTQ files. This is a comprehensive list of all the publicly available projects:

Java

http://jgi.doe.gov/data-and-tools/bbtools/
- BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. BBTools can handle common sequencing file formats such as fastq, fasta, sam, scarf, fasta+qual, compressed or raw, with autodetection of quality encoding and interleaving. It is written in Java and works on any platform supporting Java, including Linux, MacOS, and Microsoft Windows and Linux; there are no dependencies other than Java (version 7 or higher). Program descriptions and options are shown when running the shell scripts with no parameters.

Go

https://github.com/shenwei356/seqkit
- SeqKit is a cross-platform, ultrafast, and practical FASTA/Q manipulations tool that is friendly for researchers to complete wide ranges of FASTA/Q file processing. The toolkit supports plain or gzip-compressed input and output from either standard stream or files, therefore, it could be easily used in command-line pipe.

C/C++

https://github.com/lh3/seqtk
- Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
https://github.com/dcjones/fastq-tools
- This package provides a number of small and efficient programs to perform common tasks with high throughput sequencing data in the FASTQ format. All of the programs work with typical FASTQ files as well as gzipped FASTQ files.
https://github.com/lh3/bioawk
- Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.
https://github.com/agordon/fastx_toolkit
- The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
https://github.com/alastair-droop/fqtools
- fqtools is a software suite for fast processing of FASTQ files.

Python

https://github.com/fhcrc/seqmagick
- Seqmagick is a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way. Instead of having a big mess of scripts, there is one that takes arguments.
https://github.com/mdshw5/pyfaidx
- pyfaidx: Efficient pythonic random access to FASTA subsequences

Perl

https://code.google.com/p/biopieces/
- The Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in such a way that the data stream can be passed through several different Biopieces, each performing one specific task.
https://github.com/tjparnell/biotoolbox
- The Bio::ToolBox libraries provide an abstraction layer over a variety of different specialized BioPerl-style modules. For example, there is a special emphasis on the collection data values for defined genomic coordinate regions, regardless of whether the values come from a GFF database, Bam file, BigWig file, etc.
https://code.google.com/p/ea-utils/
- Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. Primarily written to support an Illumina based pipeline - but should work with any FASTQs.
https://github.com/sjackman/fastascripts
- Manipulate FASTA files.
https://github.com/tlawrence3/FAST
- The FAST Analysis of Sequences Toolbox (FAST) is a set of Unix tools (for example fasgrep, fascut, fashead and fastr) for sequence bioinformatics modeled after the Unix textutils (such as grep, cut, head, tr, etc). FAST workflows are designed for "inline" (serial) processing of flatfile biological sequence record databases per-sequence, rather than per-line, through Unix command pipelines. The default data exchange format is multifasta (specifically, a restriction of BioPerl FastA format). FAST tools expose the power of Perl and BioPerl for sequence analysis to non-programmers in an easy-to-learn command-line paradigm.

Cpp FASTA Python FASTQ • 14k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by Kamil ★ 2.3k

1

Entering edit mode

Also, fastx-toolkit, bioawk, and a gazillion other tools - it's crazy how many of these are around!

ADD REPLY • link 10.0 years ago by Ram 45k

1

Entering edit mode

And FAST (perl). One is bound to fail when taking up such a task.

ADD REPLY • link 10.0 years ago by h.mon 35k

1

Entering edit mode

Java - BBMap needs to be added to this list.

ADD REPLY • link 8.8 years ago by GenoMax 151k

0

Entering edit mode

I'm trying to get bioawk on Ubuntu using "sudo apt-get install bioawk", but it says it can't find a package named bioawk. How could I install this?

ADD REPLY • link 8.8 years ago by beneficii ▴ 60

0

Entering edit mode

You can download (or use git clone https://github.com/lh3/bioawk.git) the code and then go into bioawk-master folder and type make. That will compile the program. You can then copy the bioawk executable to a directory in your $PATH (/usr/local/bin should work).

ADD REPLY • link 8.8 years ago by GenoMax 151k

Ram · Answer 1 · 2015-06-03

2

Entering edit mode

10.0 years ago

Tariq Daouda ▴ 220

Python

There's also the the parsers module of pyGeno. It supports: FASTA, FASTQ, VCF, GTF and CSV files. With an emphasis on simple and convenient interfaces.

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 10.0 years ago by Tariq Daouda ▴ 220