In the best practice for short variant discovery from DNA sequencing, there is no mention of reads cleaning (removing adapters, low quality and short reads, etc.) . I wonder whether it's because such step is not necessary, or it's assumed the reads are already cleaned?
If it's not necessary, is it because the bad reads will be filtered out after the mapping step?
Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.
SUBMIT ANSWER is for new answers to original question.
Any good bioinformatician will run a QC before doing anything and perform cleaning if needed. I think it is omitted because of that and recent Illumina protocols returns cleaned reads. I don't remember last time I cleaned some Illumina data since the HiSeq 2000.
That is only true if your sequencing facility does pre-processing/read trimming (generally noticeable if not all reads = length of sequencing) of data (I believe if sequencing provider uses BaseSpace for processing then trimming of adapters may be done by default).
Most aligners will soft-clip parts of read that do not map so that may be one of the reasons people seem to omit scanning/trimming of data. It is a must do, if you are planning to do any de novo analysis at any point in the data life cycle.
Thanks @genomax for the insight! I'm checking fastp tool, but is a little worried that it may be a bit excessive in cleaning and correction, since it not only removes part of the reads, also makes change to the remaining bases.
since it not only removes part of the reads, also makes change to the
remaining bases
Only bases that match to a set of reference sequences (adapters) that you can provide should be matched and deleted (within the parameters that you can control e.g. mismatches allowed, initial match length etc) by a scan/trim program. No data should be changed otherwise. Corresponding Q-score strings will also be removed to keep the fastq record intact. If you choose to do Q-score based data trimming then that also will just delete things not change any bases that remain.
from what I read briefly, seems fastq, for paired sequencing, will change a base to its corresponding base in the other read of the pair if the corresponding base has much higher quality score.
Paired-end reads overlap only in instances where one deliberately designs them to do that (e.g. 16S sequencing) or in cases where the length of an insert happens to be shorter than number of cycles sequenced. Even then a special read merging program will be need to be used to do that merging (not sure if fastp does read merging as an option).
Does this apply in your case? Otherwise there should be no overlap between R1/R2.
Thanks for the comment, @JC! Even if those bad reads are kept, won't they be filtered out after mapping?
Is there any study to demo that there is a difference on variant calling accuracy w/ and w/o the cleaning step?
Here I'd like to distinguish QC and cleaning: while QC is just to read the data, cleaning is to make change to the raw data.
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.See my additional comment below @JC's answer.