Question

How to trim ngs sequences according to the phred score using python/biopython

0

Entering edit mode

10.6 years ago

linlinli99013 ▴ 10

For fastq file, .letter_annotations['phred_quality'] will give you the quality score of each base in your sequence.

For example:

@M02339:45:000000000-A8L62:1:1101:14225:1785 1:N:0:5
GTTAGCACCCGATGTCTGTCTCCCGTGATTGCACTCTTCGGTATTCGGAGTTTGCTATGGCGGGGGAATCTGCAATAGACCCCCCAACCATGACAGGGCTCTACCCCCGAAGGGGAGACACGAGGCACTACCTAAATAG
+
CCCDCFFFCCCCGGGGGGGGGGHHGHGGHHHHHHHHHHHHGGHGHHHHGGGHHHHHHHHHHGGEG//EG3B44FG433FGHHGGGG/<A=?DDFGF///=/?11FGHG->-@D.@.-..:/.-:@@CF/0:FFFFG0;:

It gave (output formatted, folded and highlight added by @Ram for easy eyeballing):

[34, 34, 34, 35, 34, 37, 37, 37, 34, 34, 34, 34, 38, 38, 38, 38,
 38, 38, 38, 38, 38, 38, 39, 39, 38, 39, 38, 38, 39, 39, 39, 39,
 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 39, 38, 39, 39, 39, 39,
 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 36,
 38, 14, 14, 36, 38, 18, 33, 19, 19, 37, 38, 19, 18, 18,    #trim starting at this 19,18,18 triplet
 37, 38, 39, 39, 38, 38, 38, 38, 14, 27, 32, 28, 30, 35, 35, 37,
 38, 37, 14, 14, 14, 28, 14, 30, 16, 16, 37, 38, 39, 38, 12, 29,
 12, 31, 35, 13, 31, 13, 12, 13, 13, 25, 14, 13, 12, 25, 31, 31,
 34, 37, 14, 15, 25, 37, 37, 37, 37, 38, 15, 26, 25]

I intend to trim the sequence from 3 consecutive Q<20 bases. How do I write the biopython program?

Thanks

next-gen biopython • 7.6k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 10.6 years ago by linlinli99013 ▴ 10

0

Entering edit mode

pl: What I want to do is to keep the sequence part with high scores [34, 34, 34, 35, 34, 37, 37, 37, 34, 34, 34, 34, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 38, 39, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 39, 38, 39, 39, 39, 39, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 36, 38, 14, 14, 36, 38, 18, 33, 19, 19, 37, 38] and delete the rest part.

ADD REPLY • link 10.6 years ago by linlinli99013 ▴ 10

Ram · Answer 1 · 2014-05-10

2

Entering edit mode

10.6 years ago

Peter 6.0k

Given a SeqRecord in the variable record, get the scores using scores = record.letter_annotations['phred_quality'] and find the index of the three poor quality scores, say i, then trim the SeqRecord object to the point using record[:i] and write that trimmed record out.

See also the trimming examples in [the Tutorial][]1

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.6 years ago by Peter 6.0k

score 0 · Answer 2 · 2014-05-09

Why reinvent the wheel? Why not use SeqTK/Trimmomatic?

EDIT: I see the problem now - these algorithms might not allow trimming in such ups-and-downs Q-score cases. Just to be sure, have you explored all QC/trimming software available?

EDIT: Trimmomatic has a SLIDINGWINDOW option that might be useful.

Ram · Answer 3 · 2014-05-10

0

Entering edit mode

10.6 years ago

linlinli99013 ▴ 10

Thank you all for the helpful answers. I tried the following lines, but they didn't work well. It trimmed one sequence many times at different low quality spots (3 consecutive Q<20 bases). I don't know how to improve it.

from Bio import SeqIO

trimseq=[]

for rec in SeqIO.parse('C:\work in SF\short.fastq','fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    for i in range(len(read_quals)):
        if read_quals[i]<=20 and read_quals[i+1]<=20 and read_quals[i+2]<=20:
           trimseq.append(rec[:i])
        else:
            trimseq.append(rec)

ADD COMMENT • link updated 21 months ago by Ram 44k • written 10.6 years ago by linlinli99013 ▴ 10

1

Entering edit mode

You need to break out of your for loop the first time the conditions are fulfilled - add break after trimseq.append(rec[:i]). Alternatively, this is a bit more streamlined:

read_quals = rec.letter_annotations['phred_quality']
count = 0
for i, q in enumerate(read_quals):
    if q <= 20:
        count += 1
        if count == 3:
            break 
    elif count:
        count = 0
print(read_quals[:i-2])

ADD REPLY • link updated 21 months ago by Ram 44k • written 10.6 years ago by george.ry ★ 1.2k

0

Entering edit mode

Thanks a lot for your reply. That did help. But when there are more than one sequences. It didn't work.

for rec in SeqIO.parse('C:\...\shortone.fastq','fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count=0
    for i, q in enumerate(read_quals):
        if q<=20:
            count +=1
            if count==3:
                break
                    print read_quals[:i-2]
        elif count:
            count=0
            print read_quals

shortone.fastq is the following:

@M02339:45:000000000-A8L62:1:1101:14225:1785 1:N:0:5
GTTAGCACCCGATGTCTGTCTCCCGTGATTGCACTCTTCGGTATTCGGAGTTTGCTATGGCGGGGGAATCTGCA
ATAGACCCCCCAACCATGACAGGGCTCTACCCCCGAAGGGGAGACACGAGGCACTACCTAAATAG
+
CCCDCFFFCCCCGGGGGGGGGGHHGHGGHHHHHHHHHHHHGGHGHHHHGGGHHHHHHHHHHGGEG//E
G3B44FG433FGHHGGGG/<A=?DDFGF///=/?11FGHG->-@D.@.-..:/.-:@@CF/0:FFFFG0;:
@M02339:45:000000000-A8L62:1:1101:11582:2214 1:N:0:5
TTCCAGCACCGGGCAGGCGTCAGCCCCTATACTTCACCTTTCGGTTTCGCAGAGACCTGTGTTTTTGATAAACAGTC
GCTTGGGCCTATTCACTGCGGCCTACTCTCGTAGGCACCCCTTCTCCCGAAGTTACGGGGTCATTTTGCCGAGTTCCT
TAACGATGGTTCTCTCGCTCACCTTAGAATTCTCTTCCCAACTACCTGTGTCGGTTTGCGGTACGGGCACCTCTTATCTCG
ATAGCGGCTTTTCTT
+
CCCCDFFFFDEEGGGGGGGGGGGHHGGHGHHHHHHHHHHHHHGGGGGHGGGGGHHHGHHHHGHHHGGGHHHHH
HHHHGGGHGGHHHGHHHHHHHHHGGGGGHHHHHHGGGGGHHHGGGGHHHHHHGGGGGHHHGGGGGGHHHHHHH
HG-AGGHHHHHHHGGHGGHHHHHHGGGGGGGGGGGGGGGGGGGHGGGGGGGGGGGGFFFFFFFFFFFHFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

ADD REPLY • link updated 21 months ago by Ram 44k • written 10.6 years ago by linlinli99013 ▴ 10

1

Entering edit mode

Sorry that wasn't really complete code, it was just a quick thing to help you on your way. You don't actually say what doesn't work, though I assume it's just down to a strange flow through the code. Hard to tell, as the indentation is very strange - Python wouldn't let you indent the print statement after the break without throwing an error, so I assume this isn't how you have it on your machine. In any case, being after the break, that line will never print anything anyway. Something like the following should be more complete - you can replace the print statements at the bottom with something useful.

for rec in SeqIO.parse(filename, 'fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count = 0
    for i, q in enumerate(read_quals):
        if q <= 20:
            count += 1
            if count == 3:
                break
        elif count:
            count = 0
    if count != 3:
        print read_quals
    else:
        print read_quals[:i-2]

ADD REPLY • link updated 21 months ago by Ram 44k • written 10.6 years ago by george.ry ★ 1.2k

Ram · Answer 4 · 2014-05-15

Final solution by george.ry. It works very well! Thanks a lot.

from Bio import SeqIO
trimseq=[]
for rec in SeqIO.parse('C:\work in SF\short.fastq', 'fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count = 0
    for i, q in enumerate(read_quals):
        if q <= 20:
            count += 1
            if count == 3:
                break
        elif count:
            count = 0
    if count != 3:
        trimseq.append(rec)
    else:
        trimseq.append(rec[:i-2])

print "Saved %i reads" % len(trimseq)
output_handle=open("C:\work in SF\qtrim_seqs.fastq", "w")
SeqIO.write(trimseq, output_handle, "fastq")
output_handle.close()