Excuse me for this long post:
I am performing a de novo genome assembly using Illumina paired-end short reads. At present, I am in the stage of trimming the adapters. Here, you can have a look at the basic statistics and information on the adapter content obtained from the Fast QC report, for R1.
Raw Reads
The basic statistics of raw reads given
The adapter content of the raw reads given
I used Trimmomatic for trimming the adapter. The following is the Trimmomatic Settings
ILLUMINACLIP:~/adapters/TruSeq3-PE.fa:2:30:10 MINLEN:36
Below, you can see the basic statistics and adapter content of the Trimmed reads.
Here, the output was:
Both surviving: 566832403 Forward only surviving: 39244376 Reverse only surviving: 0.00 Dropped reads: <1%
Now following are my questions:
Question 1
Can I go ahead with the assembly process, because there is zero adapter presence in the reads? Should I mind the loss of reads?
Question 2
I see that there are over-represented sequences, both in read 1 and read 2. I doubt if I can leave them be, or if I should trim them too. Can these over-represented sequences be trimmed using Trimmomatic? Can you provide me with suggestions on this?
The following are the over-represented sequences for R1
The following are the over-represented sequences for R2
@swbarnes2, thanks for the response, by the way, can you tell me why/ how this happens during the sequencing process, and if it could negatively affect the assembly process?
You should filter out the poly-G reads. As @swbarnes said these represent no signal i.e. no sequence data. While the assembler may be able to ignore these there is no point in leaving them in the input data.