Entering edit mode
9.5 years ago
Kamil
★
2.3k
Many developers have created tools for manipulating FASTA and FASTQ files. This is a comprehensive list of all the publicly available projects:
Java
- http://jgi.doe.gov/data-and-tools/bbtools/
- BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. BBTools can handle common sequencing file formats such as fastq, fasta, sam, scarf, fasta+qual, compressed or raw, with autodetection of quality encoding and interleaving. It is written in Java and works on any platform supporting Java, including Linux, MacOS, and Microsoft Windows and Linux; there are no dependencies other than Java (version 7 or higher). Program descriptions and options are shown when running the shell scripts with no parameters.
Go
- https://github.com/shenwei356/seqkit
- SeqKit is a cross-platform, ultrafast, and practical FASTA/Q manipulations tool that is friendly for researchers to complete wide ranges of FASTA/Q file processing. The toolkit supports plain or gzip-compressed input and output from either standard stream or files, therefore, it could be easily used in command-line pipe.
C/C++
- https://github.com/lh3/seqtk
- Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
- https://github.com/dcjones/fastq-tools
- This package provides a number of small and efficient programs to perform common tasks with high throughput sequencing data in the FASTQ format. All of the programs work with typical FASTQ files as well as gzipped FASTQ files.
- https://github.com/lh3/bioawk
- Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.
- https://github.com/agordon/fastx_toolkit
- The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
- https://github.com/alastair-droop/fqtools
- fqtools is a software suite for fast processing of FASTQ files.
Python
- https://github.com/fhcrc/seqmagick
- Seqmagick is a kickass little utility built in the spirit of imagemagick to expose the file format conversion in Biopython in a convenient way. Instead of having a big mess of scripts, there is one that takes arguments.
- https://github.com/mdshw5/pyfaidx
- pyfaidx: Efficient pythonic random access to FASTA subsequences
Perl
- https://code.google.com/p/biopieces/
- The Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in such a way that the data stream can be passed through several different Biopieces, each performing one specific task.
- https://github.com/tjparnell/biotoolbox
- The Bio::ToolBox libraries provide an abstraction layer over a variety of different specialized BioPerl-style modules. For example, there is a special emphasis on the collection data values for defined genomic coordinate regions, regardless of whether the values come from a GFF database, Bam file, BigWig file, etc.
- https://code.google.com/p/ea-utils/
- Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. Primarily written to support an Illumina based pipeline - but should work with any FASTQs.
- https://github.com/sjackman/fastascripts
- Manipulate FASTA files.
- https://github.com/tlawrence3/FAST
- The FAST Analysis of Sequences Toolbox (FAST) is a set of Unix tools (for example fasgrep, fascut, fashead and fastr) for sequence bioinformatics modeled after the Unix textutils (such as grep, cut, head, tr, etc). FAST workflows are designed for "inline" (serial) processing of flatfile biological sequence record databases per-sequence, rather than per-line, through Unix command pipelines. The default data exchange format is multifasta (specifically, a restriction of BioPerl FastA format). FAST tools expose the power of Perl and BioPerl for sequence analysis to non-programmers in an easy-to-learn command-line paradigm.
Also, fastx-toolkit, bioawk, and a gazillion other tools - it's crazy how many of these are around!
And FAST (perl). One is bound to fail when taking up such a task.
Java - BBMap needs to be added to this list.
I'm trying to get bioawk on Ubuntu using "sudo apt-get install bioawk", but it says it can't find a package named bioawk. How could I install this?
You can download (or use
git clone https://github.com/lh3/bioawk.git
) the code and then go intobioawk-master
folder and typemake
. That will compile the program. You can then copy thebioawk
executable to a directory in your $PATH (/usr/local/bin should work).