Here's another beginner BioPython question from me...
I'm running some genome assemblies for someone who has some new Illumina sequence data and also had done some sequencing a few years ago. They have some Sanger and 454 sequences (a couple thousand sequences with a couple thousand base pairs for each) and everything is in fasta format. They are trusted contigs and I'd like to use the data to help bridge assembly gaps of Illumina sequencing data (from the same extracted DNA that was kept frozen for a few years -- same bacterial strain). There are no quality files for the older data and if they ever existed they are long gone from the data directories.
What I want to do is create FASTQ files for the old Sanger and 454 FASTA files with a phred score of 40 -- I think that would be a I
under PHRED-33.
(I know there are assemblers out there that take fasta files inputs along with fastq -- but my in-house makefile and pipeline does not and I really would like to incorporate these sequences in our current workflow without a lot of changes).
There are lots of resources for converting from fasta to fastq WITH a qual file -- see here, here, here, here, and here; but I've looked all over but can't find anything for converting without a qual file.
Here is a public perl script that does exactly what I want it to do -- but I can't get it to work for me and I keep getting an error.
As an exercise, I quickly wrote the script below, but I can't get it to provide the quality score as the length of the sequence. The quality string needs to be the same length as the input sequence string. Anyone have any hints for me? Any help is appreciated.
Here's my code so far:
#!/usr/bin/env python
"""
Convert FASTA to FASTQ file with a static quality read
Usage:
$ ./fasta_to_fastq NAME.fasta NAME.fastq
"""
# import libraries
import sys, os
from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator
# Get inputs
fa_path = sys.argv[1]
fq_path = sys.argv[2]
phred_quality = "I"
# Check inputs
if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path)
# make fastq
with open(fq_path, "w") as handle:
records = PairedFastaQualIterator(open(fa_path), print("phred_quality"))
count = SeqIO.write(records, handle, "fastq")
# print fastq
print "Converted %i records" % count
Pretty sure Q40 is an "I" (73).
Thanks Matt Shirley, that was my first gut feeling, but I eventually thought wrong -- fixed my mistake.