Here I have a few reads with their corresponding quality scores. To my understanding, I translated the quality scores from this fastq file with ord() in python and changed the numbers to 0 and 1 on the condition that if the ord(char) <= 53 then it's 1, otherwise it's 1. From this method I got all 0's, so does that means that each read does not require any trimming? This is however just a test data. I have a lot bigger fastq file, but what if from this big file I got something like: 111111001101111000000000000.... etc. Is there any rule of condition I should follow when to trim the ends of a read?
(PS: It's a project from school that we need to understand how the trimming works before using an existing tool)
@HWI-EAS384_0000:2:1:1444:905#0/1
NTGTAAAGTTCGATGAGTATTTGCTTTATGGGAGAAATATCCAGCGTTTAGAAAATGTAATTTCAAGGTTACAAC
+HWI-EAS384_0000:2:1:1444:905#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:1629:903#0/1
NCAACACTTTCTGAATATGCCTTCAAAACGTGTATCATGTTGATAAATGCAATATTCCATTTCCCAACAGTGACT
+HWI-EAS384_0000:2:1:1629:903#0/1
BGGKOIJIKJ[YY[Y__________BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:1838:908#0/1
NATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAAG
+HWI-EAS384_0000:2:1:1838:908#0/1
BKKQKNQNNLWWXWWYYYYYYYYYYXXXXX[[[[[VVVNVTTWRRYYYYY_____BBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:2067:910#0/1
NGAAATTTACAAAGAAGAACACGTAATATATTCATAAACGGGGAATTTTCATCAATGGAGACAAAAAATGTCGAC
+HWI-EAS384_0000:2:1:2067:910#0/1
BIIEENNJJN____YIJLKOQQTTNQWNTN_____YYY[Y____W[[Y[[___W_BBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:2279:904#0/1
NAATCGTTCTGTTAAATCAATATTCATAAAAGGCACAAATTCATTATCGTTAATTTTTGAACTATGAAGTAATAC
+HWI-EAS384_0000:2:1:2279:904#0/1
BJJNNWWTQT_____WWWWRVTWVWY[YTYOOVVVQQNNQ_____NOROOLIJJQ____Y___W_YWYYYVPVTT
@HWI-EAS384_0000:2:1:2329:907#0/1
NCAGACAGTTCCTTATTTCTGTTCGACTGACTGAAAATTGACTTTTCTACTAGATTTTTCTAATACTTAACTTTG
+HWI-EAS384_0000:2:1:2329:907#0/1
BKHOGJINQLYYYYYYYQQY_____TVVVVXXXRVIJNLK_____YYQQYTPTMT[Y[[[QQ______Y______
@HWI-EAS384_0000:2:1:2464:909#0/1
NTTTAGCCTGGCCCATGGTTCCCAAAAAGCAATACAAAGCTTGGGTCAACTCCAGCCCAGGGTGACCAGAACCCC
+HWI-EAS384_0000:2:1:2464:909#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:2603:919#0/1
NTCGTTGCACCATTGCTTTTTGAAAAAGAATGAGTCGACTTTACGAGTTCAATTTAAAGCACAAATTTTTGCACA
+HWI-EAS384_0000:2:1:2603:919#0/1
BRRRRVVWTV_V_____________WVWQQQ________Y_____PVVVWIKQKJXRVXX___V_[[[[[_____
@HWI-EAS384_0000:2:1:2755:912#0/1
NCGAGGGGAAAGGATAAGAAACTTGATCTCACGCCGGAGAAAATAGCAGCCCAGGCTTTTGTCATCTATTTCGGT
+HWI-EAS384_0000:2:1:2755:912#0/1
BQLLNROMJP_____YY[[[QQ___BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
An approach that you can use is a sliding window: if in a sliding window of N (say 5) nucleotides the average quality drops below a cutoff M then you trim the read. This prevents 'internal' trimming when just one base has a lower quality. That is also an option in Trimmomatic.
why not use an existing tool on your original fastq files?
It's a project from school that we need to understand how the trimming works before using an existing tool
I don't get the point to switch from phred quality to your 0 or 1 quality score. If you want to trim your sequences you can use dedicated tools as fastp
It's a project from school that we need to understand how the trimming works before using an existing tool
There can be a few different possibilities of how these scores are encoded depending on how old the data is. Are you using a simple rule that as soon as you encounter a
0
you are going to trim the rest of the read until the end or are you going to use something more sophisticated like a sliding window average?