Why does Biopython's SeqIO.parse() truncate sequences when it encounters repeated bases?
0
0
Entering edit mode
10 months ago
joreamayarom ▴ 140

I'm looping through a fastq file using the method stated in this stackoverflow post.

import gzip
from Bio import SeqIO

sampleid = 'G4G3811_S77'
with gzip.open('filename.fastq.gz', 'rt') as forward_fastq:
    for forward_record in SeqIO.parse(forward_fastq, 'fastq'):
       print(sampleid + " " + forward_record.id)
       print(forward_record.seq)
       print(len(forward_record.seq))

All sequences in my fastq.gz are 151bp; however, sequences that contain repeated bases get truncated at a random (?) place before hitting the repeated segment. Is there a known explanation for this or this a bug? I checked biopython's documentation and cannot find an answer. Is there a way to get biopython not to truncate sequences? For reference, this is the output of the script provided above.

G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:13450:1010
ACAAAGCAAAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATGCTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTCTGGAAAAGATTCTGCTGACAAAATGCCTTT
151
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:9370:1040
GCGTGGTTTGGGAGGATTCTCA
22
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:10220:1050
ACAAAGCAGAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATACTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTGTGGAAAAGATTCTGCTGACTAATTGCCTTT
151
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:10850:1050
TGTGCTTATTTCCCTTTTTTTCTTTGCC
28
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:3240:1060
TGTGCTTATTTCCCTTTTTTTCTTTGCC
28

And this are the same sequences in the original file:

@FS10002148:5:BSB09416-2528:1:1101:13450:1010 1:N:0:77
ACAAAGCAAAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATGCTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTCTGGAAAAGATTCTGCTGACAAAATGCCTTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10002148:5:BSB09416-2528:1:1101:9370:1040 1:N:0:77
GCGTGGTTTGGGAGGATTCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
,FF,FFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFF:FFFFFF:FFFFFF:FF,FFFFFF:,FF,F:FFFF:F,F,FFF,F:,,FFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF
@FS10002148:5:BSB09416-2528:1:1101:10220:1050 1:N:0:77
ACAAAGCAGAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATACTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTGTGGAAAAGATTCTGCTGACTAATTGCCTTT
+
FFF,FFFF:FFFFFF,,:F:,FF:F,,:FFFFFF,FFFFF::,:F,F,FFF,FFFF,FFFFFFFF,F,FFFFFFFF,FF:,F:,FFF,,F,,FFFFFF,FFFFFFF,FF:FF,FFF,F::,,F:FFFFFF,,FFFFF,F:,FF,:FFFF:F
@FS10002148:5:BSB09416-2528:1:1101:10850:1050 1:N:0:77
TGTGCTTATTTCCCTTTTTTTCTTTGCCCTGTCTCTTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF,,FFFFFFFF:FFFFFFFF,F:F,F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10002148:5:BSB09416-2528:1:1101:3240:1060 1:N:0:77
TGTGCTTATTTCCCTTTTTTTCTTTGCCCTGTCTCTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTGGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,F:FFFFFF:F:FFFFF:FFFF::::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
biopython • 566 views
ADD COMMENT
0
Entering edit mode

I cannot reproduce this issue on my laptop. I copy-pasted your FASTQ content as smpl.fq and gzipped it to smpl.fq.gz

sampleid='xyz'
with gzip.open('./smpl.fq.gz', 'rt') as fw:
     for forward_record in SeqIO.parse(fw, 'fastq'):
         print(sampleid + " " + forward_record.id)
         print(forward_record.seq)
         print(len(forward_record.seq))

xyz FS10002148:5:BSB09416-2528:1:1101:13450:1010
ACAAAGCAAAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATGCTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTCTGGAAAAGATTCTGCTGACAAAATGCCTTT
151
xyz FS10002148:5:BSB09416-2528:1:1101:9370:1040
GCGTGGTTTGGGAGGATTCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
151
xyz FS10002148:5:BSB09416-2528:1:1101:10220:1050
ACAAAGCAGAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATACTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTGTGGAAAAGATTCTGCTGACTAATTGCCTTT
151
xyz FS10002148:5:BSB09416-2528:1:1101:10850:1050
TGTGCTTATTTCCCTTTTTTTCTTTGCCCTGTCTCTTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
151
xyz FS10002148:5:BSB09416-2528:1:1101:3240:1060
TGTGCTTATTTCCCTTTTTTTCTTTGCCCTGTCTCTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTGGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
151

I think you might be low on memory - can you try your code on the first 20000 lines of the FQ and see if that works?

ADD REPLY
0
Entering edit mode

Agree with Ram - there's nothing within biopython in general that should be driving this behaviour.

Its either a hardware limitation as above, or you may need perhaps to recompile/reinstall biopython.

ADD REPLY

Login before adding your answer.

Traffic: 2074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6