Question

Biopython Not Indexing (Or Parsing) Full .Sff File, Why?

0

Entering edit mode

11.2 years ago

Matt ▴ 30

Please examine the following code (I've changed some things for privacy).

from Bio import SeqIO
reads = SeqIO.index("/somefile.sff", "sff")
print(len(reads))
81234

Shows that I have 81,234 records indexed.

However, the sff file is split up into two sections in the run statistics form the lab. The first section, region 1, has 81,234 reads. The second section, regions 7-9, have 49,876 reads.

When I try to read the file dictionary I get this:

reads = SeqIO.to_dict(SeqIO.parse("somefile.sff", "sff"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../python2.7/site-packages/Bio/SeqIO/__init__.py", line 672, in to_dict
    for record in sequences:
  File ".../python2.7/site-packages/Bio/SeqIO/__init__.py", line 541, in parse
    for r in i:
  File ".../python2.7/site-packages/Bio/SeqIO/SffIO.py", line 882, in SffIterator
    raise ValueError("Additional data at end of SFF file")
ValueError: Additional data at end of SFF file

The only thing I can think of is perhaps biopython is expecting there to be regions 2-6, and since they don't appear to exist in this file it just blows up. I do have the .fna and .qual files to work with too, if need be. But I would really like to just use the .sff file.

biopython 454 python • 3.7k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 11.2 years ago by Matt ▴ 30

0

Entering edit mode

Sounds strange, as through maybe two SFF files have been blindly concatenation together (Biopython would read the first file and then complain about the unexpected second bit). SFF files should be merged with the Roche tools not simply concatenated.

Can you share the SFF file (privately)?

ADD REPLY • link 11.2 years ago by Peter 6.0k

0

Entering edit mode

Hi, Peter. Thanks for the input. The data isn't mine, so I'd have to ask the PI I'm working with about that, but I would think sharing probably is unlikely. When I lookup the header information I get: (header_length 1640, index_offset 437421456, index_length 1690220, number_of_reads 81234, number_of_flows_per_read 1600). If there are multiple regions, will the header only show the number of records for just the first, or should it show for all?

ADD REPLY • link 11.2 years ago by Matt ▴ 30

0

Entering edit mode

There shouldn't be 'multiple regions', there should be one and only one index block. From the description, I think your SFF file is invalid and formed by the concatenation of multiple files. If you could send me the file privately I would be able to verify that, and split the file into two self contained SFF files which could be read individually.

ADD REPLY • link 11.2 years ago by Peter 6.0k

0

Entering edit mode

Peter, thanks for offering to help us out. I'll send you a private message.

ADD REPLY • link 11.2 years ago by Matt ▴ 30

Ram · Answer 1 · 2013-09-15

Thanks Matt for sharing a sample file with me. This confirmed as I had guessed that it was actually several SFF format files concatenated together - which is not allowed under the original SFF definition, but perhaps Roche are extending it?

$ strings example.sff | grep -c "\.sff"
4

The next Biopython release 1.63 will give clearer error messages (pending any further information about why these files exist)

For example,

>>> from Bio import SeqIO
>>> d = SeqIO.index("example.sff", "sff")
Traceback (most recent call last):
...
ValueError: Your SFF file is invalid, post index 4 byte null padding region ended '.sff' which could be the start of a concatenated SFF file? See offset 439111676

And,

>>> from Bio import SeqIO
>>> count = 0
>>> for r in SeqIO.parse("example.sff", "sff"): count += 1
Traceback (most recent call last):
...
ValueError: Your SFF file is invalid, post index 4 byte null padding region ended '.sff' which could be the start of a concatenated SFF file? See offset 439111676
>>> count
84475

If there is any clear information about this from Roche and it is a deliberate extension to the file format, then I'd hope to extend the Biopython SFF support to handle this. In the short term, you must divide the file into traditional separate individual SFF files to parse them (by looking for the marker string ".sff").