How to count fasta sequences efficiently using (or not ) biopython
6
2
Entering edit mode
6.8 years ago

This is not a very memory friendly way of counting sequences from a multi fasta, any ideas to improve this?

generator = SeqIO.parse("test_fasta.fasta","fasta")
sizes = [len(rec) for rec in SeqIO.parse("test_fasta.fasta", "fasta")]

I'm avoiding using tools like grep since I want to make this more portable

biopython python fasta • 20k views
ADD COMMENT
0
Entering edit mode

How to cont fasta

Count/concatenate/check length?

ADD REPLY
0
Entering edit mode

Count as in the description, also could have guessed from the example. PD title updated

ADD REPLY
0
Entering edit mode

bioawk? github.com/lh3/bioawk

ADD REPLY
6
Entering edit mode
6.8 years ago

Standard Python will be faster than BioPython:

fh = open("test_fasta.fasta")
n = 0
for line in fh:
    if line.startswith(">"):
        n += 1
fh.close()

or shorter and possibly faster:

num = len([1 for line in open("test_fasta.fasta") if line.startswith(">")])
ADD COMMENT
6
Entering edit mode
6.8 years ago
tiago211287 ★ 1.5k

Count the number of sequences in 1 fasta file:

grep -c ">" file.fasta

Count the number of sequences in several fasta files:

find /home/folder/to/file/ -name "*.fasta" | parallel grep -Hc ">" {}

Get the length of each instance of a fasta file:

grep -v ">" 180110.fasta | awk '{print length}'
ADD COMMENT
1
Entering edit mode
6.8 years ago
yhoogstrate ▴ 150

You could take a look at this python lib: https://github.com/mdshw5/pyfaidx. It makes use of .fai files or may generate one. It is also compatible with gziped fasta's.

ADD COMMENT
1
Entering edit mode
2.9 years ago

Using grep in Python

import subprocess 
subprocess.call(['grep', '-c', '>', 'test_fasta.fasta'])
ADD COMMENT
0
Entering edit mode
2.9 years ago
onestop_data ▴ 330

I think using pysam is the fastest way to read a FASTA/FASTQ file - especially large files. You can use it to read the FASTA file and count the number of sequences - https://onestopdataanalysis.com/read-fasta-file-python/

ADD COMMENT
0
Entering edit mode
12 months ago
Mark ★ 1.6k

According to this SO post: https://stackoverflow.com/questions/393053/length-of-generator-output

This is the fastest:

len(list(generator))

This assumes that each fasta sequence is a record, so counting the records should equal the number of sequences. One issue is that this method will consume your generator.

ADD COMMENT

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6