Question

Parse Fastq File - Pad Reads With N'S

1

Entering edit mode

13.5 years ago

Steffi ▴ 590

I would like to parse a fastq file as follows: Nominal read length is 100, but some reads are shorter than others. In that case I want to pad these reads with N's and their corresponding quality values as well. So in the end all reads should have the same length. Any idea how to do this most quickly? Shell scripting or perl? Or does somebody knows a tool who already offers something along these lines? Thanx!

fastq parsing • 7.6k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 13.5 years ago by Steffi ▴ 590

1

Entering edit mode

I would recommend not to overuse shell scripts for complex tasks that cannot be solved with several unix commands. I used to do that for some time, but realized that it is a bad idea. Perl/Python is more portable, more convenient, cleaner and perhaps faster.

ADD REPLY • link 13.5 years ago by lh3 33k

1

Entering edit mode

this is just nuts, just contact the RUM people about this issue

ADD REPLY • link 13.5 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Is Python OK for you? Do you have a particular quality score in mind for the padded N bases?

ADD REPLY • link 13.5 years ago by Peter 6.0k

0

Entering edit mode

Actually I would prefer a cute shell script ;) But I know perl/python would be much better. The shorter reads have different lengths, I have to fill them up to their original length, so 100.

ADD REPLY • link 13.5 years ago by Steffi ▴ 590

0

Entering edit mode

Ah sorry, I did not answer the quality score question: I think it does not matter, just put anything (valid).

ADD REPLY • link 13.5 years ago by Steffi ▴ 590

0

Entering edit mode

I have done this, but til they have checked this I just want to go on with the data analysis...

ADD REPLY • link 13.5 years ago by Steffi ▴ 590

0

Entering edit mode

Can this be done with fasta files?

ADD REPLY • link 8.4 years ago by Longshotx ▴ 70

Ram · Answer 1 · 2012-02-08

8

Entering edit mode

13.5 years ago

Istvan Albert 102k

As long as the sequences in the fastq file do not span multiple lines you can get by with a very elegant and fast python solution that uses iterators. This avoids all conditionals and other complexities, it could look like this:

import itertools, string, sys

# strip newlines from the standard input
stream = itertools.imap(string.strip, sys.stdin)

# this works only if sequences span only one line
for head in stream:

    # advance the stream
    seq  = stream.next()
    tmp  = stream.next()
    qual = stream.next()

    # this is how much to pad
    size = 100 - len(seq)

    # pad each line
    seq  = seq  + "N" * size
    qual = qual + "2" * size

    # print the joined elements
    print "\n".join( (head, seq, tmp, qual) )

ADD COMMENT • link 13.5 years ago by Istvan Albert 102k

1

Entering edit mode

Very short and elegant solution! Only 4 lines at a time are loaded in memory? It reads from the stdin, so I can use it like this? python awesome.py < myfile.fastq > mypaddedfile.fastq

ADD REPLY • link 13.5 years ago by Frédéric Mahé ★ 3.3k

0

Entering edit mode

yes to all, perhaps I should add it to the description

ADD REPLY • link 13.5 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks! I have used it and it worked perfectly fine and very fast!

ADD REPLY • link 13.5 years ago by Steffi ▴ 590

0

Entering edit mode

This is wonderful. Thanks! A note to anyone else who might run into the same thing I did: I had to change size to 101.

ADD REPLY • link 13.4 years ago by User 2383 ▴ 10

0

Entering edit mode

This is an old thread but still.. any reason why you pad the quality line with "2"s? Wouldn't the lowest quality score "!" make more sense here?

ADD REPLY • link 9.9 years ago by aebmad ▴ 60

0

Entering edit mode

I forgot the exact reason and it may not be applicable here - I think the rationale in general when generating qualities is that some tools ignore bases that are very low in quality so we need to avoid making them too low. But then why not make them maximal? There is a reason there too, a quality such as say E can come from different types of encoding, so using that makes the encoding the file ambiguous so that too may be a problem. Picking 2 seems like a good choice between indicating Sanger encoding and avoiding bases being ignored.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.9 years ago by Istvan Albert 102k

score 3 · Answer 2 · 2012-02-08

3

Entering edit mode

13.5 years ago

Rok ▴ 190

GNU Awk version:

awk 'NR%4==0 {print t $0}; NR%4==2 {while(a++<100-length($0)){s=s "Q"; t=t "N"};print s $0}; NR%2==1 {print}' file.fastq

You can define different characters for replacement in sequence line and quality score line(N and Q above).

GNU Sed version (do not use, really slow):

sed -e :a -e '2~4 s/^.\{1,99\}$/N&/;ta' -e '4~4 s/^.\{1,99\}$/Q&/;ta' file.fastq

ADD COMMENT • link 13.5 years ago by Rok ▴ 190

0

Entering edit mode

this is neat, we have kind of a code-golf going on

ADD REPLY • link 13.5 years ago by Istvan Albert 102k

0

Entering edit mode

I thought awk version would need less time than Python version, but on 2 million lines they needed the same time.

ADD REPLY • link 13.5 years ago by Rok ▴ 190

0

Entering edit mode

I would have thought the other way around since awk concatenates one by one - real life tests are often surprising

ADD REPLY • link 13.5 years ago by Istvan Albert 102k

0

Entering edit mode

Is the awk version correct? It pads at the beginning of the line, not at the end. Plus, quality and sequence paddings seem to be inverted. Rok can you edit your post, please. I propose this corrected solution: awk 'NR%4==0 {print $0 t}; NR%4==2 {while(a++<100-length($0)){s=s "N"; t=t "Q"};print $0 s}; NR%2==1 {print}' file.fastq

ADD REPLY • link 13.5 years ago by Frédéric Mahé ★ 3.3k

score 3 · Answer 3 · 2012-02-08

3

Entering edit mode

13.5 years ago

Gustavo ▴ 530

No Perl version yet? :)

#!/bin/env perl
$target = 100;
while (<>) {
    if (/^(\@|\+)/) {
        print;
        $_ = <>;
        chomp;
        print $_, ($1 eq '@' ? 'N' : '2') x ($target-length()), "\n";
    }
}

ADD COMMENT • link 13.5 years ago by Gustavo ▴ 530

Ram · Answer 4 · 2012-02-08

Here's something in python. Doesn't require biopython. It's pretty ugly, but should get the job done:

import sys

inFile = open(sys.argv[1],'r')

header = seq = qual = ''
seqs = quals = False

for line in inFile:
    if line[0] == "@":
        if header != '':
            if len(seq) < 100:
                pad = 100-len(seq)
                seq += 'N' * pad
                qual += '>' * pad

            print "@" + header + "\n" + seq + "\n" + "+" + header + "\n" + qual

        header = line[1:].strip()
        quals = False
        seqs = True
        qual = seq = ''
    elif line[0] == "+":
        if quals:
            qual += line.strip()
        else:
            quals = True
            seqs = False
    else:
        if quals:
            qual += line.strip()
        else:
            seq += line.strip()

if len(seq) < 100:
    pad = 100-len(seq)
    seq += 'N' * pad
    qual += '>' * pad

print "@" + header + "\n" + seq + "\n" + "+" + header + "\n" + qual

Use it so:

python theScript.py yourFile.fastq > output.fastq

Ram · Answer 5 · 2012-02-08

Here's a pure Bash solution. The code could be factorized and a little bit simplified, but it works.

#! /bin/bash
# name: pad_fastq.sh
# usage: bash pad_fastq.sh myfile.fastq

header=""
seq=""
plus=""
qual=""

while read line ; do
    if [[ ${line:0:1} == "@" ]] ; then
        header="${line}"
    elif [[ ${line:0:1} == "+" ]] ; then
        plus="${line}"
    elif [[ "${plus}" ]] && [[ "${line}" ]] ; then
        length=${#line}
        if [[ "${length}" -lt 100 ]] ; then
            diff=$(( 100 - length ))
            padding=$(for((i=1 ; i<=diff ; i++)) ; do printf "%s" ">" ; done)
            qual="${line}${padding}"
        else
            qual="${line}"            
        fi
        echo -e "${header}\n${seq}\n${plus}\n${qual}"
        header=""
        seq=""
        plus=""
        qual=""
    elif [[ "${header}" ]] && [[ "${line}" ]] ; then
        length=${#line}
        if [[ "${length}" -lt 100 ]] ; then
            diff=$(( 100 - length ))
            padding=$(for((i=1 ; i<=diff ; i++)) ; do printf "%s" "N" ; done)
            seq="${line}${padding}"
        else
            seq="$line"            
        fi
    fi
done < "${1}"

exit 0