Retriving the base and its quality from fastq file
1
0
Entering edit mode
8.9 years ago
SJ Basu ▴ 60

Hello,

I am trying to remove low quality bases and 'N's from fastq sequence file. Now being a beginner perl programmer I do know how to handle fasta file, but not good enough with fastq, so I can only remove the base but not its quality value. Basically I am unable to form the logic on how to do it.

@FCC4WLAACXX:3:1101:1697:1896#GCCAATAT/1
NTCGAGGACCTTGGTTGAGCC
+
BP\ccceegggggiiiiiiih

Say I remove N in the beginning, then how do I remove the B (quality value) too?

Can anybody help me with this ?? A perl solution would be highly helpful. Thanks in advance, appreciate your time.

RNA-Seq perl sequencing • 2.5k views
ADD COMMENT
0
Entering edit mode

*takes deep breath* Why reinvent the wheel? Why not use an existing Fastq QC tool?

ADD REPLY
0
Entering edit mode

Sir, I agree to your point...but I am on an assignment...its like i have to code for it, may be i have to improvise to suit the code for my needs. :(

ADD REPLY
0
Entering edit mode

Then don't you think this is the perfect time to learn. With these kind of assignments you will be forced to learn quickly on your own. Look at the regular expressions in perl. Try some code to do what you want. Try to debug the errors you get. Even if you won't, post the code you tried so far. Then people will help to develop your code.

ADD REPLY
0
Entering edit mode

Exactly this. Also, first off, I don't see how you'd pick the base first - quality trimming logic says you'd filter by the quality score first. Any working code you manage to write in this assignment may seem to work fine on small files, but will definitely face scalability problems and in all probability will have invisible bugs. Unless you're working on a memory optimized proof of concept on trimming tools (which I don't think you are), you're better off using existing tools.

ADD REPLY
0
Entering edit mode

Sir, Basically i cant form a logic to correlate the same points (1st base in above example) on two separate lines,i.e, 2nd and 4th... as for learning, I do know good amount of regex... i said i can handle fasta files and tab separated file but not fastq !!!... and well i did come up with something primitive but... where do I post the code in "add reply" section or "add comment" section, if you may tell me please...

ADD REPLY
0
Entering edit mode

You're looking at manipulating identical positions in two character arrays (the sequence and quality score strings) in a size-4 array of character arrays (the fastq entry). read_as_char_2d_array[1][I] is the ith base in the first read's sequence, read_as_char_2d_array[3][I] is the corresponding quality score. That's the logic.

Again, in this case, logic does not matter. Optimized implementation is paramount to usability. Please, please PLEASE do not write your own fastq trimmer. I can say with confidence that you are not equipped to tackle this challenge yet.

Side note: read_as_char_2d_array is a variable name I chose arbitrarily, don't give it too much thought. I have a habit of using really long variable names.

ADD REPLY
0
Entering edit mode

Thank you @Ram for the clarification... yes you are right about me not being equipped to tackle this challenge and that's why I was here :) ....I'll post the code in here, see if I am somewhr !!!

ADD REPLY
0
Entering edit mode
8.9 years ago

you can convert this python script into perl: Illumina Trimming Algorithm

ADD COMMENT
0
Entering edit mode

well ma'am I really appreciate your help but I myself is a perl-beginner on top interpreting the aforementioned python script is really difficult !!! :(

ADD REPLY
0
Entering edit mode

uff. write some code and post what have you tried.

ADD REPLY
2
Entering edit mode

I'd remove the "uff" :) Everyone here get it, believe me :)

ADD REPLY

Login before adding your answer.

Traffic: 1595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6