Hello everyone,
I have approx. 30X whole genome data of Human samples (5) [illumina HiSeq 2500, PE sequencing] and want to run Trimmomatics just for removing bad/poor quality bases/reads. As per 'FastQC' output the data seems to be good with warning only in 'Per Base Sequence Content', 'Per Sequence GC Content' and failure in 'K-mer' content. I am not worried about K-mer content but my aim is to remove the bad/poor reads. For that purpose, I ran the command similar to this one:
java -jar trimmomatic-0.30.jar PE \
--phred33 \
input_forward.fq.gz input_reverse.fq.gz \
output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:36
After running the command, each of my data set reduced with approx 10-13GB in overall folder size of each sample and quality improved as per FastQC plots. I want to be sure that is the standard way of using Trimmomatics for improving reads of WGS data of human samples. Down the line my aim is to find sequence variants, SNPs among these samples and their correlation with human disease progression. Particularly, I want to be sure in parameters of ILLUMINACLIP
and SLIDINGWINDOW
. please give me your suggestions. Thank you
What's the number of reads before and after trimming? Similarly, what's the number of bp before and after?
Hi, I didn't have the log file of run, i'll rerun it and then i'll update it more precisely. thank you
Are these are optimum parameters for WGS run of Human data with approx. 30X coverage in order to look for genetic variants/SNPs etc. Should i move ahead with 'Both Surviving' reads only or 'forward only surviving'/'Reverse only surviving' are also of some use?. Thank you, Regards..Ravi.
You used the default parameters and I think it should be good enough. Frankly speaking, although trimming is important to make sure that you can increase the percentage of mapped reads and also reduce false positives in variant calling but I would not be super concern about the trimming parameters. You can control the quality of bases to be used towards the variant calling in samtools mpileup or GATK HaplotypeCaller.
94% read pairs have both reads survived. Because you are not loosing a lot by ignoring singleton reads, I would suggest you go with 'Both surviving' reads to keep it simple. FYI, Bowtie2 can take both a set of paired reads and also single end reads for alignment in one go. I dont think BWA can take both paired end and single end reads in one go. I would still recommend to use 'both surviving' reads.
Thank you Ashutosh Pandey, Is it normal to lose 10-12G of fastq data while trimming process of a Human sample of WGS of about 30X coverage ?. My initial Forward and Reverse reads fastq data were 38G each file but when i operated Trimmomatics, it reduced to 26G each (Paired only). Although the 'both surviving' rate was ~94% but i don't want to lose important data while trimming process.....that's why put this question to be sure about parameters of Trimmomatics software.
Ok got it. It is not normal to loose one-third of the data. You are loosing a lot of data at the nucleotide level and not at the read level because 95% of pairs survived. This indicates that you have lost a lot of data as individual bases during trimming but not enough to meet the MINLEN parameter. As Noolean said, you may need to check average and median lengths of reads before and after trimming. You may do this analysis on a subset of the reads. You should also tell us about the platform used and the read length.