Please examine the following code (I've changed some things for privacy).
from Bio import SeqIO
reads = SeqIO.index("/somefile.sff", "sff")
print(len(reads))
81234
Shows that I have 81,234 records indexed.
However, the sff file is split up into two sections in the run statistics form the lab. The first section, region 1, has 81,234 reads. The second section, regions 7-9, have 49,876 reads.
When I try to read the file dictionary I get this:
reads = SeqIO.to_dict(SeqIO.parse("somefile.sff", "sff"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python2.7/site-packages/Bio/SeqIO/__init__.py", line 672, in to_dict
for record in sequences:
File ".../python2.7/site-packages/Bio/SeqIO/__init__.py", line 541, in parse
for r in i:
File ".../python2.7/site-packages/Bio/SeqIO/SffIO.py", line 882, in SffIterator
raise ValueError("Additional data at end of SFF file")
ValueError: Additional data at end of SFF file
The only thing I can think of is perhaps biopython is expecting there to be regions 2-6, and since they don't appear to exist in this file it just blows up. I do have the .fna and .qual files to work with too, if need be. But I would really like to just use the .sff file.
Sounds strange, as through maybe two SFF files have been blindly concatenation together (Biopython would read the first file and then complain about the unexpected second bit). SFF files should be merged with the Roche tools not simply concatenated.
Can you share the SFF file (privately)?
Hi, Peter. Thanks for the input. The data isn't mine, so I'd have to ask the PI I'm working with about that, but I would think sharing probably is unlikely. When I lookup the header information I get: (header_length 1640, index_offset 437421456, index_length 1690220, number_of_reads 81234, number_of_flows_per_read 1600). If there are multiple regions, will the header only show the number of records for just the first, or should it show for all?
There shouldn't be 'multiple regions', there should be one and only one index block. From the description, I think your SFF file is invalid and formed by the concatenation of multiple files. If you could send me the file privately I would be able to verify that, and split the file into two self contained SFF files which could be read individually.
Peter, thanks for offering to help us out. I'll send you a private message.