Hi my colleagues,
Question 1. trim the adapters when we known them.
Suppose the adapter for my DNA sequencing are as the following:
P5 adaptor: 5' ACACTCTTTCCCTACAC***GACGCTCTTCCGATCT***
P7 adaptor: 5' P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
We can trim the adapter for the raw fastq files with trim_galore
or cutadapt as the following:
cutadapt -a AGATCGGAAGAGCA -g GCTCTTCCGATCT -o sample.trim.fastq sample.raw.fastq
trim_galore --paired -a GATCGGAAGAGCA -a2 GCTCTTCCGATCT --retain_unpaired --trim1 S1.read1.fq S1.read2.fq
However, I am little confused that why 610,514 reads containing of "GATCGGAAGAGCA" can be found in read2.fastq
?
BTW: GATCGGAAGAGCA is the reverse complementary of GACGCTCTTCCGATCT
Any suggestion?
Question 2. trim the adapters when they are unknown.
Is there any violent and forcible method to remove the reads containing the adapters? check each adapter (illumina have hundreds adapters)? because I think these adapters should be not contained by human genome, isn't it? therefore, if the reads contain such adapter sequence, they should be filter out. beat me beat me beat me!!
No it's not. That's part of your problem, you're not passing the right sequences as arguments.