Why are there different record counts when using grep vs biopython's SeqIO.parse()?

0

Entering edit mode

8.1 years ago

ahtmatrix • 0

I'm trying to figure out the disparity between the 2 record counts when I'm using grep and this simple Biopython record counter script. I'm using the following gbk file: ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/RNA/rna.gbk.gz

When using grep -c "//" rna.gbk, I get 176432 records. When using the following script, I get 176426 records.

from Bio import SeqIO
import sys
import os

filename = sys.argv[1]

count = 0

if filename.endswith('.gbk'):
    filetype = "genbank"
elif filename.endswith('.fasta'):
    filetype = "fasta"

for record in SeqIO.parse(filename, filetype):
    count = count + 1


print("There were " + str(count) + " records in file " + filename)

So there's a 6 record difference between the 2 methods. Why is this?

EDIT: using grep -c ACCESSION rna.gbk gives 176426 records and so does grep LOCUS rna.gbk | grep -c RNA

biopython grep SeqIO • 2.3k views

ADD COMMENT • link 8.1 years ago by ahtmatrix • 0

1

Entering edit mode

Maybe the file has some duplicates? grep LOCUS rna.gbk | grep -c RNA will be consistent with BioPython.

ADD REPLY • link 8.1 years ago by Eric Lim ★ 2.2k

0

Entering edit mode

how does that command work? its grabbing all LOCUS and then piping back to grep to grab RNA within LOCUS?