Tool:SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang
2
19
Entering edit mode
8.2 years ago

Hi all,

I'd like to introduce you a cross-platform and every fast FASTA/Q toolkit, Seqkit, written in Golang.

Introduction

Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly.

SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OS X, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations.

I had used SeqKit to solved some problems raised by Biostars users in simple and efficient ways. For examples:

Benchmarks

SeqKit uses author's lightweight and high-performance bioinformatics packages bio for FASTA/Q parsing, which has high performance close to the famous C lib klib (kseq.h).

benchmark

FASTA manipulations benchmark.5tests.tsv.png

FASTQ manipulations benchmark.5tests.tsv.C.png

Subcommands

Sequence and subsequence

  • seq transform sequences (revserse, complement, extract ID...)
  • subseq get subsequences by region/gtf/bed, including flanking sequences
  • sliding sliding sequences, circular genome supported
  • stat simple statistics of FASTA files
  • faidx create FASTA index file

Format conversion

  • fx2tab covert FASTA/Q to tabular format (and length/GC content/GC skew)
  • tab2fx covert tabular format to FASTA/Q format
  • fq2fa covert FASTQ to FASTA

Searching

  • grep search sequences by pattern(s) of name or sequence motifs
  • locate locate subsequences/motifs

Set operations

  • rmdup remove duplicated sequences by id/name/sequence
  • common find common sequences of multiple files by id/name/sequence
  • split split sequences into files by id/seq region/size/parts
  • sample sample sequences by number or proportion
  • head print first N FASTA/Q records

Edit

  • replace replace name/sequence by regular expression
  • rename rename duplicated IDs

Ordering

  • shuffle shuffle sequences
  • sort sort sequences by id/name/sequence

Misc

  • version print version information and check for update
FASTQ SeqKit FASTA • 14k views
ADD COMMENT
0
Entering edit mode

I just used seqkit to make a shell wrapper to take fasta length distribution that I wanted to share in order to let you know that how useful this (seqkit) could be.

Script

#shell wrapper around seqkit to calculate length distribution of fasta sequences

# Usage 
# [cmd_prompt]$ source length_dist.sh <fasta_file.fa>

# Requires
# seqkit: https://bioinf.shenwei.me/seqkit/

echo -ne "Length < 200:\t"
cat $1 |  seqkit seq -M 199 | seqkit stat -T | grep -v file | cut -f 4

echo -ne "Length >= 200 && <= 300:\t"
cat $1 |  seqkit seq -m 200 -M 300 | seqkit stat -T | grep -v file | cut -f 4

for i in $(seq 300 100 900); do 
  echo -ne "Length > $i && <= $(expr $i + 100):\t" 
  cat $1 |  seqkit seq -m $(expr $i + 1) -M $(expr $i + 100) | seqkit stat -T | grep -v file | cut -f 4
done

echo -ne "Length > 1000 && <=5000:\t"
cat $1 |  seqkit seq -m 1001 -M 5000 | seqkit stat -T | grep -v file | cut -f 4

echo -ne "Length >5000:\t" 
cat $1 |  seqkit seq -m 5000 | seqkit stat -T | grep -v file | cut -f 4

Output

[user]$ sh  length_range_stats.sh file.fasta 
Length < 200:   457
Length >= 200 && <= 300:    2458
Length > 300 && <= 400: 2746
Length > 400 && <= 500: 2815
Length > 500 && <= 600: 2537
Length > 600 && <= 700: 2366
Length > 700 && <= 800: 2123
Length > 800 && <= 900: 1953
Length > 900 && <= 1000:    1905
Length > 1000 && <=5000:    17672
Length >5000:   215

Ofcourse, there is scope for improvement and can be modified according to requirements. For me, that was required!

Thank you my friend Wei

ADD REPLY
1
Entering edit mode

How about outputting sequence lengths and ploting using other tools

$ seqkit fx2tab hairpin.fa -l | csvtk cut -t -f 4 | csvtk plot hist -H > hist.png

image

Or

# https://github.com/derwiki/histogram
$ seqkit fx2tab hairpin.fa -l | csvtk cut -t -f 4 | python histogram.py                      
['histogram.py']
28645 items, 12 keys
0-99           17670================================================================================
100-199        9717 ===========================================
200-299        907  ====
300-399        210  
400-499        102  
500-599        20   
600-699        8    
700-799        3    
800-899        2    
900-999        3    
1000-1099      2    
1100-1199      1
ADD REPLY
0
Entering edit mode

That's even better !!

ADD REPLY
0
Entering edit mode

Hi,

I am trying to extract sequences from a gzipped fastq file(17GB) using sequence ID list in a text file (2.8GB) using the following:

seqkit grep --pattern-file id.txt raw-reads.fastq.gz > subset.fastq.gz

However, the resulting subset.fastq.gz file is empty. Could you please tell how to deal with such huge files? Or is the command is incorrect in the first place?

ADD REPLY
0
Entering edit mode

Can you post the output of head -6 id.txt?

ADD REPLY
0
Entering edit mode

head -6 id.txt

@D00723:299:CCRTLANXX:1:1101:1281:1987 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1301:1993 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1660:1986 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1769:1980 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1755:1982 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:2165:1989 2:N:0:1

ADD REPLY
0
Entering edit mode

you need remove the leading symbol @ by

sed 's/^@//' id.txt > id2.txt
ADD REPLY
0
Entering edit mode

Hello, I am trying to transform some fastq.gz files to their reverse complement, but when I run the command

seqkit seq hairpin.fastq.gz -r -p

I get the output printing to the console. Is that what this function is supposed to do? I want to edit the file itself.

Thank you

ADD REPLY
0
Entering edit mode

Reposting because I accidentally posted as a reply: I am trying to transform some fastq.gz files to their reverse complement, but when I run the command

seqkit seq hairpin.fastq.gz -r -p

I get the output printing to the console. Is that what this function is supposed to do? I want to edit the file itself.

Thank you

ADD REPLY
0
Entering edit mode

Why would you want to edit the input file itself? Most programs will not allow you to edit the input file with output. Why not write the results to a new file and use that?

ADD REPLY
0
Entering edit mode

I found out my reads were the reverse complement of what I needed to run my analysis (in a program downstream of this step). I did send the output to a new file (with >), which messed up the fastq format, but did have the reverse complements.

ADD REPLY
0
Entering edit mode

Can you try running seqkit seq -rp hairpin.fastq.gz > rev_comp.fastq? You should provide program options before input and output files. I just tested this out and encountered no fastq corruption.

You can also use reformat.sh from BBMap suite to do this.

reformat.sh in=hairpin.fastq.gz out=rev_comp.fastq.gz rcomp=t
ADD REPLY
0
Entering edit mode

Thank you @GenoMax! This helps tremendously! The seqkit command worked perfectly!

ADD REPLY
1
Entering edit mode
8.2 years ago
John 13k

I've only tested it briefly, specifically just the stat module, but it seems to give odd results.

  1. Non-aminoacid/nucleiacid characters like spaces are included in all the lengths
  2. Comments aren't parsed out
  3. It doesn't recognize ";" as being a legitimate alternative to ">" when starting an entry

The FASTA format is really awful so I do not blame you - however, if anyone should get parsing FASTA right it's tools like this :) I've had big problems trying to linearize FASTA files with comments in the past, so this is something i'd be interested in seeing fixed :)

In terms of the User Interface you've created, I think it's the best i've seen so far - for ordered and logical! Great job!

ADD COMMENT
2
Entering edit mode

Thanks for your comments John,

  1. Since SeqKit is modulized, the stat command does not clean spaces and other characters, which can be done by seqkit seq --remove-gaps. So you can use seqkit seq -g seqs.fa | seqkit stat. I can also let it work in the way you want.
  2. I don't know what kind of comments you mean? comment after the sequence identifier in the heading line or lines starting with #.
  3. I do not know ";" could be a alternative to ">" :(. ">" is the most common format.
ADD REPLY
1
Entering edit mode

1) Ah, so that is the linearize function. Very cool :) Particularly since your binaries have no dependencies. I think i will use this a lot.

2) Comments look like:

>GENE1
AGCTGTCTC
ACAGTCTAG ;this is a comment
CAGTACCCA

You never see comments in anything produced by a NGS machine, but using comments used to be quite popular back in my old lab when FASTA was used to store SNPs. Comments stored the reference base (or variant) using the format above, but I guess VCF put an end to all that. Still, it was frustrating as hell to parse properly, which is why toolkits like this would have been so useful back when I had to do that.

3) Yes '>' is the most common format, and honestly should be the only format, however I have seen plenty of FASTA files over the years with ";" :( I thought maybe it would be a simple thing to add to your parser module.

Two more points: I just tried using "seq --validate-seq" on some strings with strange characters, but it didn't seem to do anything? I was expecting a sort of 'PASS/FAIL' type output. Second thing, and this just a request and maybe I should have put it on the github page, but it would be nice if the stat module could count the occurences of each character seen in the sequence, for example:

A  - 342342
C  - 435342
G  - 445231
T  - 331459
\n - 4000
F  - 2

This would help people know if there are any characters that shouldn't be in the sequence that could cause problems later on, for example the 'F' above. Or maybe you just look for odd characters in --validate-seq and it prints out the line number the odd characters are on? That would be one small feature i'd really appreciate, and maybe others would too..? I dont know.

ADD REPLY
2
Entering edit mode
  1. sed -r 's/;.+//g' could be used to delete the comments.
  2. Supporting ";" is easy, I'll add this.
  3. seq --validate-seq. Here's the mechanism.
    1. If the sequence type not explicitly given by -t (--seq-type), SeqKit will guess it according to the first record. Supported sequence types include: DNA, RNA, Protein and Unlimit. So if you give some records of which the first record has some strange characters, SeqKit thinks all the records are "unlimit" formats. Therefore, no warning came out.
    2. Alternatively, you can set the sequence type, so it will validate the sequences according to given seq type.
  4. For the character stats. Because a FASTA/Q file may contains multiple records, we choose to show bases occurences stats for EVERY record by seqkit fx2tab.

    1. seqkit fx2tab --alphabet converts FASTA/Q formats to tabular formats, and prints alphabet of every records at last column. e.g.

      $ echo -e ">dna\nactg\n>bad dna\nactgo" | seqkit fx2tab --header-line -a
      #name   seq     qual    alphabet
      dna     actg            acgt
      bad dna actgo           acgot
      

      You can locate the record with help of my another tool, csvtk

      $ echo -e ">dna\nactg\n>bad dna\nactgo" | seqkit fx2tab -a | csvtk --tabs --no-header-row  grep --fields 4 --use-regexp --ignore-case --pattern '[^ACTGN]+'
      bad dna actgo           acgot
      
    2. seqkit fx2tab --base-content or seqkit fx2tab -B can outputs base countents not occurences.

      $ echo -e ">dna\nactg\n>dna2\nACGNTC" | seqkit fx2tab --header-line  -B a -B c -B g -B t
      #name   seq     qual    a       c       g       t
      dna     actg            25.00   25.00   25.00   25.00
      dna2    ACGNTC          16.67   33.33   16.67   16.67
      
ADD REPLY
1
Entering edit mode

Best answer I could have possibly got - great work Wei! Thank you so much :)

ADD REPLY
1
Entering edit mode

Bug of csvtk fixed in v0.3.8.1, John you can try it now.

ADD REPLY
2
Entering edit mode

Hey Wei, i’m really sorry, I re-read the FASTA documentation today and i’m just flat out wrong about “;” being an alternative to “>”. I don't know where I got it into my head that ";" was an alternative to ">", probably Wikipedia. It’s actually only the very first line of the FASTA file, containing a header, than can start with “;”. From my reading of the docs, you can’t even have a header for each sequence entry, it’s literally just 1 per file.

Really sorry for the lost time. If you send me a paypal link or similar i’ll buy you a beer.

ADD REPLY
1
Entering edit mode

Don't worry, I haven't changed the code yet.

ADD REPLY
0
Entering edit mode
4.1 years ago

Collecting all the previous answer and giving information for somebody starting from scratch. This plot is for doing a read length distribution if you have a fasta file.

1) Go to the terminal and install the required programs (I like use conda)

conda create -n seqkit_env

conda activate seqkit_env

conda install -c bioconda/label/cf201901 seqkit

conda install -c bioconda csvtk

2) Then type next command (replace file.fasta with your fasta file)

seqkit fx2tab file.fasta -l | csvtk cut -t -f 4 | csvtk plot hist -H > hist.png

3) Then see your graph in hist.png

Best

ADD COMMENT

Login before adding your answer.

Traffic: 1252 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6