I manually checked the output of the Trimmomatic and was confused that a paired seqs were assigned to unpaired in the output. Here is my Trimmomatic command line:
java -Xms8g -jar Trimmomatic.jar PE -threads 6 -phred33 sample1_R1.fastq.gz sample1_R2.fastq.gz sample1_forward_paired.fastq.gz sample1_forward_unpaired.fastq.gz sample1_reverse_paired.fastq.gz sample1_reverse_unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 MINLEN:50
Then I manually checked one sequence in one of the output files: sample1_forward_unpaired.fastq.gz:
@NB501800:50:H3NW5BGX3:1:11101:4253:1049 1:N:0:TTAGGC
CTCTTNATGACGCTTGTGGAATGTGTCGTTCACATTGTAAGTGATGTCATCAACAATGCACTGATCTCGAAGCTGCGAGTAGGCAATGCATGTCCATTCC
+
AAAAA#AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAAAEEEEEEEEEEEEAEEEEEAEEEEEAEEEEA
Apparently, an adapter seq was trimmed from the original sequence. However, this sequence ID can be found in both raw sequence files, sample1_R1.fastq.gz and sample1_R2.fastq.gz.
sample1_R1.fastq.gz:
@NB501800:50:H3NW5BGX3:1:11101:4253:1049 1:N:0:TTAGGC
CTCTTNATGACGCTTGTGGAATGTGTCGTTCACATTGTAAGTGATGTCATCAACAATGCACTGATCTCGAAGCTGCGAGTAGGCAATGCATGTCCATTCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC
+
AAAAA#AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAAAEEEEEEEEEEEEAEEEEEAEEEEEAEEEEAEE/EAEEEEE/AEEEAAEEEEE/AAAAAEAEAEEEEAEAEE/<<<<EEEAA
sample1_R2.fastq.gz:
@NB501800:50:H3NW5BGX3:1:11101:4253:1049 2:N:0:TTAGGC
GGAATGGACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGACATCACTTACAATGTGAACGACACATTCCACAAGCGTCATGAAGAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG
+
AAAAAEEEEE#########################################EE<EEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEEAEEEA/AEEEEAEEE/AEEEEEEEEAEAEEEEEEEEEEEE<EAAAAAAAAAEAEEA
By using MINLEN:50
, I expected a sequence longer than 50bp retained even after trimming. My question is why the trimmed seq was assigned to sample1_forward_unpaired.fastq.gz rather than sample1_forward_paired.fastq.gz.
This sequence indeed disappeared in sample1_reverse_paried.fastq.gz, but why it's been eliminated entirely in sampe1_R2.fastq.gz. It is true that R2 was also contaminated with adapter but why it was removed other than kept like in the R1. If it was because the poor quality of "NNNN...", I still saw many remaining sequences containing these type of sequences.
Thanks for your reply @Brian Bushnell. My sequence length is 151bp and I guess the length may be unlikely below 50bp even after adapter timmimg off. In what reasons do you think the sequence was eliminated?