Entering edit mode
10 months ago
joreamayarom
▴
140
I'm looping through a fastq file using the method stated in this stackoverflow post.
import gzip
from Bio import SeqIO
sampleid = 'G4G3811_S77'
with gzip.open('filename.fastq.gz', 'rt') as forward_fastq:
for forward_record in SeqIO.parse(forward_fastq, 'fastq'):
print(sampleid + " " + forward_record.id)
print(forward_record.seq)
print(len(forward_record.seq))
All sequences in my fastq.gz are 151bp; however, sequences that contain repeated bases get truncated at a random (?) place before hitting the repeated segment. Is there a known explanation for this or this a bug? I checked biopython's documentation and cannot find an answer. Is there a way to get biopython not to truncate sequences? For reference, this is the output of the script provided above.
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:13450:1010
ACAAAGCAAAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATGCTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTCTGGAAAAGATTCTGCTGACAAAATGCCTTT
151
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:9370:1040
GCGTGGTTTGGGAGGATTCTCA
22
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:10220:1050
ACAAAGCAGAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATACTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTGTGGAAAAGATTCTGCTGACTAATTGCCTTT
151
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:10850:1050
TGTGCTTATTTCCCTTTTTTTCTTTGCC
28
G4G3811_S77 FS10002148:5:BSB09416-2528:1:1101:3240:1060
TGTGCTTATTTCCCTTTTTTTCTTTGCC
28
And this are the same sequences in the original file:
@FS10002148:5:BSB09416-2528:1:1101:13450:1010 1:N:0:77
ACAAAGCAAAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATGCTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTCTGGAAAAGATTCTGCTGACAAAATGCCTTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10002148:5:BSB09416-2528:1:1101:9370:1040 1:N:0:77
GCGTGGTTTGGGAGGATTCTCACTGTCTCTTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
,FF,FFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFF:FFFFFF:FFFFFF:FF,FFFFFF:,FF,F:FFFF:F,F,FFF,F:,,FFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF
@FS10002148:5:BSB09416-2528:1:1101:10220:1050 1:N:0:77
ACAAAGCAGAAAAGCCCGCTGGAAAAGGATCCCCTTCAACTTTGCAAACCCCAGGAAGTTCTTCAGGTGCCTCTCTTCATACTGTTGGACCTAATCAAGGTGGACTATCTCAAGGTCTTTGTGGAAAAGATTCTGCTGACTAATTGCCTTT
+
FFF,FFFF:FFFFFF,,:F:,FF:F,,:FFFFFF,FFFFF::,:F,F,FFF,FFFF,FFFFFFFF,F,FFFFFFFF,FF:,F:,FFF,,F,,FFFFFF,FFFFFFF,FF:FF,FFF,F::,,F:FFFFFF,,FFFFF,F:,FF,:FFFF:F
@FS10002148:5:BSB09416-2528:1:1101:10850:1050 1:N:0:77
TGTGCTTATTTCCCTTTTTTTCTTTGCCCTGTCTCTTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF,,FFFFFFFF:FFFFFFFF,F:F,F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10002148:5:BSB09416-2528:1:1101:3240:1060 1:N:0:77
TGTGCTTATTTCCCTTTTTTTCTTTGCCCTGTCTCTATACACATCTCCGAGCCCACGAGACATGACAGCACATCTGGTATGCCGTCTTCTGCTTGAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,,F:FFFFFF:F:FFFFF:FFFF::::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I cannot reproduce this issue on my laptop. I copy-pasted your FASTQ content as
smpl.fq
and gzipped it tosmpl.fq.gz
I think you might be low on memory - can you try your code on the first 20000 lines of the FQ and see if that works?
Agree with Ram - there's nothing within biopython in general that should be driving this behaviour.
Its either a hardware limitation as above, or you may need perhaps to recompile/reinstall biopython.