Counting Contigs in Fasta/GBK File
2
0
Entering edit mode
4.8 years ago

I wanted to check the no of contigs present in either a FASTA or GBK file, I am aware of algorithms such as CheckM that will allow for this process, however is there a direct code to check no of contigs in a sequence directly with python or biopython?

checkm contigs genome • 5.0k views
ADD COMMENT
0
Entering edit mode

you can try with basic utilities in *nix.

ADD REPLY
0
Entering edit mode

like with grep commands etc?

ADD REPLY
2
Entering edit mode

An easy grep solution to count entries in a genbank, is the number of LOCUS lines:

grep -c "LOCUS" multigenbank.gb

For a multifasta, you can use ^> instead of LOCUS as you have noted.

ADD REPLY
1
Entering edit mode
4.8 years ago

grep, sed, awk etc. Something like this:

$ cat test.fa 
>a
atgc
>b
atgc
>c
atgc

$ awk '/>/ {a++} END {print "number of sequences in this file: " a}' test.fa
number of sequences in this file: 3
ADD COMMENT
1
Entering edit mode

yeah, I just tried this command this helped for determining the number of contigs per file (Just change the extension file for both cases):

Individual File:

$grep -c "^>" Streptomyces_sp_12.fna

Multiple Files :

$ grep -c "^>" *.fna

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode
4.8 years ago
Joe 21k

Easy in BioPython.

 from Bio import SeqIO
 recs = list(SeqIO.parse('genbank.gbk', 'genbank'))
 len(recs)

This could be more memory efficient with an iterator, but this is a quick and easy way.

This is likely a more robust solution too, since *nix solutions require that you know your files very well, such that they don't have any nasty surprises in them.

ADD COMMENT

Login before adding your answer.

Traffic: 1741 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6