Hello,
I am trying to remove low quality bases and 'N's from fastq sequence file. Now being a beginner perl programmer I do know how to handle fasta file, but not good enough with fastq, so I can only remove the base but not its quality value. Basically I am unable to form the logic on how to do it.
@FCC4WLAACXX:3:1101:1697:1896#GCCAATAT/1
NTCGAGGACCTTGGTTGAGCC
+
BP\ccceegggggiiiiiiih
Say I remove N in the beginning, then how do I remove the B (quality value) too?
Can anybody help me with this ?? A perl solution would be highly helpful. Thanks in advance, appreciate your time.
*takes deep breath* Why reinvent the wheel? Why not use an existing Fastq QC tool?
Sir, I agree to your point...but I am on an assignment...its like i have to code for it, may be i have to improvise to suit the code for my needs. :(
Then don't you think this is the perfect time to learn. With these kind of assignments you will be forced to learn quickly on your own. Look at the regular expressions in perl. Try some code to do what you want. Try to debug the errors you get. Even if you won't, post the code you tried so far. Then people will help to develop your code.
Exactly this. Also, first off, I don't see how you'd pick the base first - quality trimming logic says you'd filter by the quality score first. Any working code you manage to write in this assignment may seem to work fine on small files, but will definitely face scalability problems and in all probability will have invisible bugs. Unless you're working on a memory optimized proof of concept on trimming tools (which I don't think you are), you're better off using existing tools.
Sir, Basically i cant form a logic to correlate the same points (1st base in above example) on two separate lines,i.e, 2nd and 4th... as for learning, I do know good amount of regex... i said i can handle fasta files and tab separated file but not fastq !!!... and well i did come up with something primitive but... where do I post the code in "add reply" section or "add comment" section, if you may tell me please...
You're looking at manipulating identical positions in two character arrays (the sequence and quality score strings) in a size-4 array of character arrays (the fastq entry).
read_as_char_2d_array[1][I]
is theith
base in the first read's sequence,read_as_char_2d_array[3][I]
is the corresponding quality score. That's the logic.Again, in this case, logic does not matter. Optimized implementation is paramount to usability. Please, please PLEASE do not write your own fastq trimmer. I can say with confidence that you are not equipped to tackle this challenge yet.
Side note:
read_as_char_2d_array
is a variable name I chose arbitrarily, don't give it too much thought. I have a habit of using really long variable names.Thank you @Ram for the clarification... yes you are right about me not being equipped to tackle this challenge and that's why I was here :) ....I'll post the code in here, see if I am somewhr !!!