How To Pull Out Qual From Multi Line Fastq?
2
1
Entering edit mode
13.0 years ago
Zhshqzyc ▴ 520

From http://biostar.stackexchange.com/questions/1389/how-to-generate-a-consensus-fasta-sequence-from-sam-tools-pileup We use awk and perl to get fasta by

awk '/^@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\\+//" > chr1.fasta

The sample data

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA  
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ

Now I want to extract the qual for each chromosome. But the qual score for each chromosome all beginning with "+", how can I extract them?

Thanks.

fastq quality • 5.2k views
ADD COMMENT
0
Entering edit mode

It quality doesn't wrap lines can't you just grep for "+"?

ADD REPLY
2
Entering edit mode
13.0 years ago

Here is quick and extremely dirty python script to pull out fastq quality not making assumptions about line wrapping:

import sys

inFile = open(sys.argv[1],'r')

header = ''
qual = ''

seqs = False
quals = False
for line in inFile:
    if line[0] == "@":
        if header != '':
            print ">" + header
            print qual
        header = line[1:].strip()
        quals = False
        qual = ''
    elif line[0] == "+":
        quals = True
    else:
        if quals:
            qual += line.strip()

print ">" + header
print qual

Save as yourName.py and run it by: python yourName.py yourFile.fastq > output.quals

edit * There was a mistake in the code where the qual variable wasn't being reset.

edit **

Here is the script to get just one entry:

import sys

inFile = open(sys.argv[1],'r')
queryHeader = sys.argv[2]

header = ''
qual = ''

seqs = False
quals = False
for line in inFile:
    if line[0] == "@":
        if header != '':
            if header == queryHeader:
                print ">" + header
                print qual
                sys.exit()
        header = line[1:].strip()
        quals = False
        qual = ''
    elif line[0] == "+":
        quals = True
    else:
        if quals:
            qual += line.strip()

print ">" + header
print qual

Save it as yourName.py. Use it by: python yourName.py yourFile.fastq "entry name"

ADD COMMENT
0
Entering edit mode

My fastq file is 6GB, I got the quals file is more than 12GB. Is it right?

ADD REPLY
0
Entering edit mode

No, it is 37.8GB. Could you please double check your code?

ADD REPLY
0
Entering edit mode

Oops, there was a mistake. I've fixed it.

ADD REPLY
0
Entering edit mode

Now I got:There are many returns, is it correct?

chr2 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ]]]`G]]`````cfTcffcfcOilifffEciiiififfcccccccfccffil Ullllliorrrrooiiooorueuuuuruxuuuxxrrrroorrrrrolllliilloooororoololoroollllllllllliilllrr QQQQQQQHQQQQQQQQQQQQQQQQTTTTQQQQQHQNNNNNNNNNNNNNKKKKKKKKHHK

NNKKKNTTTTWWWQWZZZZT`]ccccifcllllilllooorrooooorlflllooooil

]]]WWWWWTTTTTTTQWZZ]Z]]```]]Z]Z]]]]]]]]]cccfcliioouuuuuuu

{xxuuuuuruxxxx{{xx{xuuurrorlllloooorrrrooooooooouorrloouuuu

ADD REPLY
0
Entering edit mode

updated: If I only want to get qual for chromosome 1, how to change code?

ADD REPLY
0
Entering edit mode

I've added the code to just get one entry.

ADD REPLY
0
Entering edit mode

I've added the scrip for getting just one entry. Make sure you typed the name of the entry correctly with quotes if there are spacing. If you didn't type it correctly, it will just give you the last entry.

ADD REPLY
0
Entering edit mode

Many many many thanks.

ADD REPLY
0
Entering edit mode

line[0] can be '+' for one of the qual lines if the phred33 score is 10. So maybe this code may break in some cases.

0 1 2 3 4 01234567890123456789012345678901234567890 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI

ADD REPLY
0
Entering edit mode

line[0] can be '+' for one of the qual lines if the phred33 score is 10. So maybe this code may break in some cases.

ADD REPLY
0
Entering edit mode

Also @ is ASCII 64 which is 64 - 33 = PHRED Q 31. The quality line can contain and begin with both + and @.

ADD REPLY
1
Entering edit mode
13.0 years ago
Haluk ▴ 190

fq2fa tool extracts qual scores from fastq file.

Here is the link : http://www.phyloware.com/Phyloware/XSTK.html

ADD COMMENT

Login before adding your answer.

Traffic: 1808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6