I would like to parse a fastq file as follows:
Nominal read length is 100, but some reads are shorter than others. In that case I want to pad these reads with N's and their corresponding quality values as well. So in the end all reads should have the same length.
Any idea how to do this most quickly? Shell scripting or perl?
Or does somebody knows a tool who already offers something along these lines?
Thanx!
ADD COMMENT
• link
updated 2.3 years ago by
Ram
44k
•
written 12.9 years ago by
Steffi
▴
580
1
Entering edit mode
I would recommend not to overuse shell scripts for complex tasks that cannot be solved with several unix commands. I used to do that for some time, but realized that it is a bad idea. Perl/Python is more portable, more convenient, cleaner and perhaps faster.
Actually I would prefer a cute shell script ;) But I know perl/python would be much better. The shorter reads have different lengths, I have to fill them up to their original length, so 100.
As long as the sequences in the fastq file do not span multiple lines you can get by with a very elegant and fast python solution that uses iterators. This avoids all conditionals and other complexities, it could look like this:
import itertools, string, sys
# strip newlines from the standard input
stream = itertools.imap(string.strip, sys.stdin)
# this works only if sequences span only one line
for head in stream:
# advance the stream
seq = stream.next()
tmp = stream.next()
qual = stream.next()
# this is how much to pad
size = 100 - len(seq)
# pad each line
seq = seq + "N" * size
qual = qual + "2" * size
# print the joined elements
print "\n".join( (head, seq, tmp, qual) )
Very short and elegant solution! Only 4 lines at a time are loaded in memory? It reads from the stdin, so I can use it like this? python awesome.py < myfile.fastq > mypaddedfile.fastq
I forgot the exact reason and it may not be applicable here - I think the rationale in general when generating qualities is that some tools ignore bases that are very low in quality so we need to avoid making them too low. But then why not make them maximal? There is a reason there too, a quality such as say E can come from different types of encoding, so using that makes the encoding the file ambiguous so that too may be a problem. Picking 2 seems like a good choice between indicating Sanger encoding and avoiding bases being ignored.
Is the awk version correct? It pads at the beginning of the line, not at the end. Plus, quality and sequence paddings seem to be inverted. Rok can you edit your post, please. I propose this corrected solution: awk 'NR%4==0 {print $0 t}; NR%4==2 {while(a++<100-length($0)){s=s "N"; t=t "Q"};print $0 s}; NR%2==1 {print}' file.fastq
I would recommend not to overuse shell scripts for complex tasks that cannot be solved with several unix commands. I used to do that for some time, but realized that it is a bad idea. Perl/Python is more portable, more convenient, cleaner and perhaps faster.
this is just nuts, just contact the RUM people about this issue
Is Python OK for you? Do you have a particular quality score in mind for the padded N bases?
Actually I would prefer a cute shell script ;) But I know perl/python would be much better. The shorter reads have different lengths, I have to fill them up to their original length, so 100.
Ah sorry, I did not answer the quality score question: I think it does not matter, just put anything (valid).
I have done this, but til they have checked this I just want to go on with the data analysis...
Can this be done with fasta files?