Hello
I have Illumina sequenced fastq files. Virtually every read (although not all) starts with the triplet "TAA". I assumed these we adapters. However, when I use Trimmomatic with:
ILLUMINACLIP:Trimmomatic-0.39/adapters/TruSeq2-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 MINLEN:36
These triplets still remain. Can someone please advise what they are and how they should be dealt with.
Thanks
C
EDIT:
I see no mention of what adapters were used but the report doc states: "As for the sequencing of GBS library, the sequenced reads of 144 bp at either end are adapter-free, which could be directly subjected to quality control for low quality reads filtration. The retaining sequences in 144 bp length (namely clean data) are qualified for mapping with the reference genome". So I am puzzled why these motifs are so prevalent.
you do mean 'start' as that it is present on the 5' end of the reads?
If so, it would be really strange to be adapters as they typically do not occur on the 5' end of a read (due to the sequencing protocol it's practically impossible to see them on the 5' end).
So I suspect something else might be going on. Can you provide numbers on this? how many reads do have this ...
Also, are you sure that you will need to use TruSeq2 adapter set? If I'm not mistaken that was the adapter set for illumina GAII sequencers, so unless you have some data coming from (the old) GAII sequencers , it is more likely you'll need to use TruSeq3 set (though that does not make any difference for the 5' end issue)
Hi so I calculate they're present in around 97% of reads. Yes, at 5' end, I tried with TruSeq3 and they're still there. Thanks
yes, because adapter trimming will (normally) not remove anything from the 5' end of a read.
Do you have any idea how the fragmentation was done? are you using a random protocol or was there some other manipulation involved?
I will try to find out - thanks for your input