I am fairly new to PacBio data analysis and I have a question:
Why do I need to extract the hifi reads? What is the meaning of the other reads that have a Q<20 mean? Should I simply ignore those reads? If I include them in the analysis, are the results reliable?
This really all depends on what you plan to do with the data downstream. Most of the current downstream applications expect HiFi (>=Q20) data. For instance, ff you're going to generate a _de novo_ assembly with hifiasm or call small variants with DeepVariant, including the <Q20 reads will cause problems with accuracy, memory usage, and runtime. For detecting structural variation with pbsv, if you use the correct parameters, you might get some added value from the <Q20 reads.
thanks for your answer! But why are there so many reads with Q<20? I wasn't able to find this information. Also, I would like to look at the repeat regions, I think if I include reads with Q<20 I will have memory usage problem
thanks for your answer! But why are there so many reads with Q<20? I wasn't able to find this information. Also, I would like to look at the repeat regions, I think if I include reads with Q<20 I will have memory usage problem