Question

BioPython: convert fasta to fastq without quality score input file

5

Entering edit mode

10.6 years ago

Josh Herr 5.8k

Here's another beginner BioPython question from me...

I'm running some genome assemblies for someone who has some new Illumina sequence data and also had done some sequencing a few years ago. They have some Sanger and 454 sequences (a couple thousand sequences with a couple thousand base pairs for each) and everything is in fasta format. They are trusted contigs and I'd like to use the data to help bridge assembly gaps of Illumina sequencing data (from the same extracted DNA that was kept frozen for a few years -- same bacterial strain). There are no quality files for the older data and if they ever existed they are long gone from the data directories.

What I want to do is create FASTQ files for the old Sanger and 454 FASTA files with a phred score of 40 -- I think that would be a I under PHRED-33.

(I know there are assemblers out there that take fasta files inputs along with fastq -- but my in-house makefile and pipeline does not and I really would like to incorporate these sequences in our current workflow without a lot of changes).

There are lots of resources for converting from fasta to fastq WITH a qual file -- see here, here, here, here, and here; but I've looked all over but can't find anything for converting without a qual file.

Here is a public perl script that does exactly what I want it to do -- but I can't get it to work for me and I keep getting an error.

As an exercise, I quickly wrote the script below, but I can't get it to provide the quality score as the length of the sequence. The quality string needs to be the same length as the input sequence string. Anyone have any hints for me? Any help is appreciated.

Here's my code so far:

#!/usr/bin/env python

"""
Convert FASTA to FASTQ file with a static quality read

Usage:
$ ./fasta_to_fastq NAME.fasta NAME.fastq
"""

# import libraries
import sys, os
from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator

# Get inputs
fa_path = sys.argv[1]
fq_path = sys.argv[2]
phred_quality = "I"

# Check inputs
if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path)

# make fastq
with open(fq_path, "w") as handle:
    records = PairedFastaQualIterator(open(fa_path), print("phred_quality"))
    count = SeqIO.write(records, handle, "fastq")

# print fastq
print "Converted %i records" % count

parsing phred fasta BioPython fastq • 23k views

ADD COMMENT • link updated 18 months ago by capemaster • 0 • written 10.6 years ago by Josh Herr 5.8k

0

Entering edit mode

Pretty sure Q40 is an "I" (73).

ADD REPLY • link 10.6 years ago by Matt Shirley 10k

0

Entering edit mode

Thanks Matt Shirley, that was my first gut feeling, but I eventually thought wrong -- fixed my mistake.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Josh Herr 5.8k

Ram · Accepted Answer · 2014-05-06

8

Entering edit mode

10.6 years ago

Matt Shirley 10k

ADD COMMENT • link 10.6 years ago by Matt Shirley 10k

1

Entering edit mode

this is a better solution as it runs as a command line tool

ADD REPLY • link 10.6 years ago by Istvan Albert 102k

1

Entering edit mode

Thanks Matt Shirley, this is awesome! Much appreciated!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Josh Herr 5.8k

0

Entering edit mode

Hello Matt Shirley,

I used your code and it worked great! I have a question though, I am trying to now import these files to Geneious and it is asking "What quality encoding is used by these files?" The three options are: 1. Sanger/ Illumina 1.8+, 2. Solexa/ Illumina 1.0-1.2 or 3. Illumina 1.3-1.7. Thank you for any help you can give me!

ADD REPLY • link 8.6 years ago by lizgzara ▴ 10

1

Entering edit mode

I would choose Sanger since this is the standard for all new Illumina sequencing, and a Sanger scaled Phred quality score of 40 equals an error probability of 1 in 10,000. The Solexa and Illumina <1.8 quality score encoding uses a different range of ASCII characters, and so you'll want to stay away from those.

ADD REPLY • link 8.6 years ago by Matt Shirley 10k

Ram · Accepted Answer · 2014-05-06

4

Entering edit mode

10.6 years ago

Istvan Albert 102k

The solution is to create a record within each loop and assign qualities to that. This has some explanations: http://biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

seq = Seq("NACGTACGTA")
rec = SeqRecord(seq, id="foo")
rec.letter_annotations["phred_quality"] = [40] * len(seq)
recs = [ rec ]
handle = file("out.fq", "wt")
SeqIO.write(recs, handle, "fastq")

Better yet do it with bioawk:

$ more fasta.awk 
{ 
  qual = ""
  for (i=0; I<length($seq); i++){
    qual = qual "I"
  }
  printf ("@%s\n%s\n+\n%s\n", $name, $seq, qual)
}

then run it as

cat test.fa | awk -c fastx -f fasta.awk

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 10.6 years ago by Istvan Albert 102k

2

Entering edit mode

And for those of us who have only "vanilla" awk:

awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", $2)}' test.fa > test.fastq

Command corrected as suggested by h.mon (see below).

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.6 years ago by Frédéric Mahé ★ 3.2k

1

Entering edit mode

very cool indeed, but I guess for a correct fastq output the command should be:

awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", $2)}' test.fa > test.fastq

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.2 years ago by h.mon 35k

0

Entering edit mode

Oops, you are right h.mon, I used the wrong symbols (> and @ instead of @ and +).

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.2 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

The command with gawk is:

awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print $1"\n"$2"\n+\n"gensub(/./, "I", "g", $2)}'

ADD REPLY • link 18 months ago by capemaster • 0

0

Entering edit mode

Thank you, Frédéric Mahé, once again your skills come to the rescue!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Josh Herr 5.8k

0

Entering edit mode

Thank you Istvan Albert, both the python solution and the bioawk solution worked. I have to spend more time with both, but I keep forgetting to use bioawk. Thanks again!

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Josh Herr 5.8k

0

Entering edit mode

You've beaten me to the punch with an explanation, so I'm just trying out Gist embedding.

ADD REPLY • link 10.6 years ago by Matt Shirley 10k

Ram · Accepted Answer · 2014-05-06

4

Entering edit mode

10.6 years ago

Peter 6.0k

Since Matt & Istvan have given you SeqRecord based solutions using SeqIO, here's a light weight solution using strings:

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open("input.fasta") as in_handle:
    with open("output.fastq", "w") as out_handle:
        for title, seq in SimpleFastaParser(in_handle):
            out_handle.write("@%s\n%s\n+\n%s\n" \
                             % (title, seq, "I" * len(seq)))

Note that faking quality scores like this is probably not a good idea, and to be used with caution.

ADD COMMENT • link 9.6 years ago by Peter 6.0k

0

Entering edit mode

Thank you, Peter, much appreciated. I need to spend some time with SeqIO.

I have thought long and hard about faking the quality scores and I'm not completely convinced it's the way to go. I'm actually running the assemblies (as a test) with just the Illumina data, the Illumina data with Sanger/454 fasta input and with Illumina data with Sanger/454 in "faked" fastq. I'm going to compare integrity to the reference material we have. Maybe I will have more on that soon.

Thanks again for your help and your open source tools. Cheers.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Josh Herr 5.8k