Sequence length from Fasta
7
6
Entering edit mode
10.1 years ago
bongbang ▴ 90

Is there a ready-made command-line program that can output sequence lengths from a fasta file? Like so:

YKR054C,4092
YLR106C,4910
...

I've seen examples of people trying to roll their own using awk, but I would rather not do that if I can avoid it. Between Emboss, Biopython, Bioperl, etc there must be something that can do this, right? What I'm looking for would be the equivalent of what blast_formatter can do for blast hits, something that can extract simple information that's already there in a format specified by user. Thank you.

fasta • 76k views
ADD COMMENT
1
Entering edit mode

"but I would rather not do that if I can avoid it" why ?

ADD REPLY
1
Entering edit mode

Because I looked at the syntax for awk and it seemed rather messy (it didn't help that the name reminds me of "awkward"). I thought I was going to save myself time, but now it appears that awk is probably the most efficient way.

ADD REPLY
0
Entering edit mode
ADD REPLY
27
Entering edit mode
10.1 years ago
Ram 44k

Heng Li's bioawk might be a good compromise between the simplicity of awk and readability of programming languages.

bioawk -c fastx '{ print $name, length($seq) }' < sequences.fa
ADD COMMENT
1
Entering edit mode

That's exactly what I was looking for. If you make that an answer, with the missing quotation mark added, I'll accept it. Thanks.

ADD REPLY
0
Entering edit mode

It's done :)

ADD REPLY
44
Entering edit mode
10.1 years ago

Why not samtools faidx sample.fa and then cut -f1-2 sample.fa.fai? The fasta index file will give you sequence name and length in the first two columns, and you probably already have this file indexed.

ADD COMMENT
11
Entering edit mode
6.1 years ago
mg ▴ 290

Using SeqKit's fx2tab:

seqkit fx2tab --length --name --header-line  foo.fasta

This prints sequence name and length with empty columns for sequences and qualities.

ADD COMMENT
0
Entering edit mode

seqkit really is the best

ADD REPLY
4
Entering edit mode
10.1 years ago
Renesh ★ 2.2k

You can make your own code for this,

Copy and paste following code in fasta_len.py file:

import sys
from Bio import SeqIO

FastaFile = open(sys.argv[1], 'rU')

for rec in SeqIO.parse(FastaFile, 'fasta'):
    name = rec.id
    seq = rec.seq
    seqLen = len(rec)
    print name, seqLen

FastaFile.close()

Make this file executable by:

chmod u+x fasta_len.py

and run as,

./fasta_len.py file.fasta

Note: you should have installed biopython

ADD COMMENT
1
Entering edit mode
import sys
from Bio.SeqIO.FastaIO import SimpleFastaParser

FastaFile = open(sys.argv[1], 'rU')
LenFile = open('./data/lengths.csv', 'w')

for name, seq in SimpleFastaParser(FastaFile):
    seqLen = len(seq)
    LenFile.write(name + ',' + str(seqLen) + '\n')

FastaFile.close()
LenFile.close()

This is somewhat more efficient for large files http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:low-level-fasta-fastq

ADD REPLY
0
Entering edit mode

I was hoping to avoid something like that (which is really not too burdensome at all, I know). I now realize that sometimes writing your own script can actually take less time than looking for a canned solution. (Doesn't always work out that way, though.)

ADD REPLY
0
Entering edit mode

Works like a charm. Thank you!

ADD REPLY
3
Entering edit mode
10.1 years ago
hpmcwill ★ 1.2k

Assuming you already have EMBOSS installed you could use 'infoseq' to get the format you desire. For example:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length sequences.fa
2AAA_HUMAN,589
AAPK1_PIG,385
AAPK1_MOUSE,559
AAPK1_PONAB,554
AAPK1_HUMAN,559
AAPK1_RAT,559
...

Depending on the identifier format used in your source fasta file you may need to choose between the appropriate -name, -accession, -gi or -seqversion option for the identifier column, or use the "-sformat pearson" option to maintain the original source identifier. For example to return the original identifiers when they are in db|acc|id format:

$ infoseq -auto -nocolumns -delimiter ',' -only -noheading -name -length -sformat pearson sequences.fa
sp|P30153|2AAA_HUMAN,589
sp|Q09136|AAPK1_PIG,385
sp|Q5EG47|AAPK1_MOUSE,559
sp|Q5RDH5|AAPK1_PONAB,554
sp|Q13131|AAPK1_HUMAN,559
sp|P54645|AAPK1_RAT,559
...
ADD COMMENT
0
Entering edit mode
10.1 years ago
onuralp ▴ 190

Granted that this is a ridiculous way to do this, I thought it might be useful to demonstrate the algorithmic steps and how they are implemented using Perl. You can think about how to modify and improve various steps, and write your own scripts.

Make sure that you have Perl installed on your computer.

C:\Users\bongbang\Desktop\perl -v

Copy and paste the following code into a text file and save it as calculateLength.pl

#!/usr/bin/perl
use warnings;
use strict;

# Point to where your input fasta file is located
my $fasta_file = "/Users/bongbang/Desktop/my_sequences.fasta";

# Declare that you want to read this file, and print the error message
# in case that the input fasta file does not exist at the specified path
open(FASTA, "<", $fasta_file) or die("Probably wrong path: $fasta_file\n");

# We will linearize the sequences into this hash
my %singleLineSequences;

# Initialize a variable to store sequence ids
my $sequence_id;

# Read the fasta file line by line
while(<FASTA>){
    my $line = $_; chomp($line);

    # if this is a new sequence with the header at the beginnning
    # Extract sequence id using regular expression (\S+) and
    # store it into the variable $sequence_id

    if ($line =~ m/^>(\S+)/){

        $sequence_id = $1; # e.g., YKR054C            

        # Reserve an entry in your hash
        # $sequence_id = YKR054C
        # $singleLineSequences{'YKR054C'} = ""
        # No sequence has been added yet, hence the empty value
        $singleLineSequences{$sequence_id} = ""; 

        }

    # if the line is not a header but part of a sequence, 
    # append this part to the corresponding sequence id in
    # your hash entry

    else {

        # paste the current line to the end of the sequence
        $singleLineSequences{$sequence_id} = $singleLineSequences{$sequence_id} . $line;

        }
   }

# Now that your hash contains single line sequences, you can simply
# loop over each sequence in your hash, determine the sequence length,
# and print it out

foreach my $sequence_entry (keys %singleLineSequences){

    # grab a hash entry and store it into a variable
    my $currentSequence = $singleLineSequences{$sequence_entry};

    # determine length of the sequence
    my $lengthSequence = length($currentSequence);

    # print the result: id,length
    print $sequence_entry . "," . $lengthSequence . "\n";
}

Go into the directory including your script, and execute it as follows:

C:\Users\bongbang\Desktop\perl calculateLength.pl
ADD COMMENT
0
Entering edit mode

Why not use BioPerl to achieve both module reuse as well as algorithm elicitation?

ADD REPLY
0
Entering edit mode

Hi,

I aim to identify the number of alanines as a function of the sequence length. Is there a way to count the number of alanines in each sequence along with the sequence length? An output that looks something like:

sp|P0A9Z1|GLNB_ECOLI,112,A=10

sp|P62517|OPGH_ECOLI,847,A=23

sp|P76290|CMOA_ECOLI,247,A=41

Thanks in advance.

ADD REPLY
0
Entering edit mode

What have you tried?

ADD REPLY
0
Entering edit mode
3 months ago
alejandrogzi ▴ 140

Hi all,

check this: https://github.com/alejandrogzi/chromsize

Usage

Usage: chromsize --fasta <FASTA> --output <OUTPUT> [-t <THREADS>]

Arguments:
    -f, --fasta <FASTA>: FASTA file
    -o, --output <OUTPUT>: path to chrom.sizes

Options:
    -t, --threads <THREADS>: number of threads [default: your max ncpus]
    --help: print help
    --version: print version

build in rust, the simplest and fastest way to get them. Works as a binary, library and is ported to python. Here is the benchmark with a lot of tools:

benchmark

ADD COMMENT

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6