It seems that I have misunderstood the paper (Awkward).
For NGS gDNA sequencing, assembly/mapping, I normally trim off the adaptor sequences when they are present. This paper (https://www.nature.com/articles/nsmb.2660 ), removed all the reads that aligned to adaptors.
I am confused and wondering the motivation of removing all the reads that aligned to adaptors. Apparently, it gives low read counts after QC.
I am not sure if read length/platform/library determine removal just adaptor sequences or the reads. Could anyone advise please? Thank you.
That is pretty unconventional to do. What if the adapter content is like 10 basepairs on a 100bp read. No aligner in the world would align these 10bp, even in local mode, it is just too short. That only would make sense if the entire read was basically an adapter but this should actually not happen in a proper library. I would simply trim them off as everyone else in the world does. If after trimming the reads are short then discard it, most trimmers have a minimum-length option.
Where does it say in this paper that they removed all reads that aligned to adaptors?I briefly read through the methods section and there is nothing about this there. Edit: This information was further down in a different section as pointed out by OP.I pasted part of the method below.
"Deep sequencing and quality control (QC). The libraries were sequenced on the Illumina HiSeq 2000 platform using the 100-bp single-end sequencing strategy.
In total, we generated 438 Gb (raw data) for 124 single-cell cDNA samples. The original image data generated by the sequencing machine were converted into sequence data via base calling (Illumina pipeline CASAVA v1.8.0) and then subjected to standard QC criteria to remove all of the reads that fit any of the following parameters:
1 The reads that aligned to adaptors or primers with no more than two mismatches.
2 The reads with more than 10% unknown bases (N bases).
3The reads with more than 50% of low-quality bases (quality value ≤ 5) in one read.
... ... " Hope this help.
To have an entire read match adaptor/primer sequence with no more than 2 mismatches indicates that these reads were likely from primer dimers or from reads with very short inserts representing read through. So that is what a scan/trim program would normally do.
That is not low number of reads left after QC.
Thanks. I got you. I misunderstood the paper. I thought it meant that the alignment has no more than two mismatches.
You understood it right. Alignment has "no more than two mismatches" means that the read matched perfectly at remaining
N-2
bases with primer/adapter sequence.