I would like to parse a fastq file as follows:
Nominal read length is 100, but some reads are shorter than others. In that case I want to pad these reads with N's and their corresponding quality values as well. So in the end all reads should have the same length.
Any idea how to do this most quickly? Shell scripting or perl?
Or does somebody knows a tool who already offers something along these lines?
Thanx!
I would recommend not to overuse shell scripts for complex tasks that cannot be solved with several unix commands. I used to do that for some time, but realized that it is a bad idea. Perl/Python is more portable, more convenient, cleaner and perhaps faster.
Actually I would prefer a cute shell script ;) But I know perl/python would be much better. The shorter reads have different lengths, I have to fill them up to their original length, so 100.
As long as the sequences in the fastq file do not span multiple lines you can get by with a very elegant and fast python solution that uses iterators. This avoids all conditionals and other complexities, it could look like this:
import itertools, string, sys
# strip newlines from the standard input
stream = itertools.imap(string.strip, sys.stdin)# this works only if sequences span only one lineforheadin stream:
# advance the streamseq= stream.next()
tmp = stream.next()
qual = stream.next()# this is how much to pad
size = 100 - len(seq)# pad each lineseq=seq + "N" * size
qual = qual + "2" * size
# print the joined elements
print "\n".join((head, seq, tmp, qual))
Very short and elegant solution! Only 4 lines at a time are loaded in memory? It reads from the stdin, so I can use it like this? python awesome.py < myfile.fastq > mypaddedfile.fastq
I forgot the exact reason and it may not be applicable here - I think the rationale in general when generating qualities is that some tools ignore bases that are very low in quality so we need to avoid making them too low. But then why not make them maximal? There is a reason there too, a quality such as say E can come from different types of encoding, so using that makes the encoding the file ambiguous so that too may be a problem. Picking 2 seems like a good choice between indicating Sanger encoding and avoiding bases being ignored.
Is the awk version correct? It pads at the beginning of the line, not at the end. Plus, quality and sequence paddings seem to be inverted. Rok can you edit your post, please. I propose this corrected solution: awk 'NR%4==0 {print $0 t}; NR%4==2 {while(a++<100-length($0)){s=s "N"; t=t "Q"};print $0 s}; NR%2==1 {print}' file.fastq
Here's something in python. Doesn't require biopython. It's pretty ugly, but should get the job done:
import sys
inFile =open(sys.argv[1],'r')
header = seq = qual =''
seqs = quals =Falsefor line in inFile:if line[0]=="@":if header !='':iflen(seq)<100:
pad =100-len(seq)
seq +='N'* pad
qual +='>'* pad
print"@"+ header +"\n"+ seq +"\n"+"+"+ header +"\n"+ qual
header = line[1:].strip()
quals =False
seqs =True
qual = seq =''elif line[0]=="+":if quals:
qual += line.strip()else:
quals =True
seqs =Falseelse:if quals:
qual += line.strip()else:
seq += line.strip()iflen(seq)<100:
pad =100-len(seq)
seq +='N'* pad
qual +='>'* pad
print"@"+ header +"\n"+ seq +"\n"+"+"+ header +"\n"+ qual
I would recommend not to overuse shell scripts for complex tasks that cannot be solved with several unix commands. I used to do that for some time, but realized that it is a bad idea. Perl/Python is more portable, more convenient, cleaner and perhaps faster.
this is just nuts, just contact the RUM people about this issue
Is Python OK for you? Do you have a particular quality score in mind for the padded N bases?
Actually I would prefer a cute shell script ;) But I know perl/python would be much better. The shorter reads have different lengths, I have to fill them up to their original length, so 100.
Ah sorry, I did not answer the quality score question: I think it does not matter, just put anything (valid).
I have done this, but til they have checked this I just want to go on with the data analysis...
Can this be done with fasta files?