I would like to know how to get the best quality reads and map them to a reference sequence. I have read this Fastq Quality Control Shootout among many other post. I have a pair ended sample and I want to get the high quality reads, what should I do ?
Also I would appreciate if you could comment on how to set the parameters for example for trimming look at this one mentioned in above post
perl TrimmingReads.pl -i data/s1.fq -l 10 -r 10 -o data/tmp.fq
Is it good to do this ?
I use FastQC on fastq files then I run the data through Trimmomatic, then run FastQC again, and then I compare to the original report to see how clipping modified the data.
what is the best way to do this? Trimmomatic or something else ?
Why
is a good question and what is your definition of high quality? If you are aligning to a reference genome then you could potentially use reads with qualities down to Q15 or so. If you are interested in looking for SNP's then you may want to up that threshold to a higher level.You can use current tools like bbduk, cutadapt, trimmomatic which will allow you to do quality based trimming effectively. In general I have not found a need to do this with sequence data that is from well made libraries and was recently sequenced.
@genomax so you would not do any QC on exome sequencing? I have read many papers and all do the QC . so why would they do that ? which steps do you think are necessary for me to take in QC and go forward especially when I have a paired ended sample . I added more info in my question
I did not say that one should do not do QC. You should absolutely do QC for all NGS data before deciding if you need to take corrective action for a specific issue(s) you found in QC.
If you use the command line quoted above you are doing a blanket removal of 10 bases from beginning and end of reads without any consideration of quality. Does FastQC report suggest that you need to do this removal?
Having paired-end data has no special bearing on quality. You do QC on paired-end reads independently in FastQC. FastQC results have to be considered in context of the experiment. Because you see some
fails
(red X's) does not automatically flag your data as bad. Take a look at a collection of informative blog posts about FastQC that FastQC authors have put together here: https://sequencing.qcfail.com/software/fastqc/ One or more may be applicable for your data.@genomax I liked your answer, actually I am looking at the report. I see that the N content in my data is a bit high. the GC distribution is ok, ACGT is equally seen in the data (at least In one pari) .I didn't not have any adaptor in my forward read. I just trimmed my data and I got this. Do you know what does that mean ? Input Read Pairs: 23267875 Both Surviving: 22889093 (98.37%) Forward Only Surviving: 301123 (1.29%) Reverse Only Surviving: 61783 (0.27%) Dropped: 15876 (0.07%) TrimmomaticPE: Completed successfully. Note that it gives me 4 outputs, which package now I should use for mapping or other stuff?
Depends on where those N's are and how many of them there are. I am surprised to see N's. That may be an indication that your library was not great (assuming there was no issue with the run). Looks like most of your data is passing the settings you used for trimmomatic. You can use any NGS aligner for the mapping. You will find this tutorial useful.
@genomax I liked your answer, you are amazing ! I have just a stupid question , when I use trimmomatic it gives me 4 outputs , when I want to use any aligner algorithm , they only need 2 input (forward and backward) which one of those 4 files I should use for aligner ? Thanks a lot
You will likely want to use reads that survived trimming and are still paired. Those should have file names that look like
*_1P.fq.gz
and*_2P.fq.gz
. Or if you gave explicit file names (instead of using-baseout
) then use those. Did you address the N's?@genomax it gives me
output_forward_paired.fq.gz, output_forward_unpaired.fq.gz , output_reverse_paired.fq.gz and output_reverse_unpaired.fq.gz
so should I use those ones that are paired in their name? I ran it like thisjava -jar ~/Desktop/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 ~/Desktop/Trimmomatic-0.38/forward_001.fastq.gz ~/Desktop/Trimmomatic-0.38/reverse_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:~/Desktop/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
. Related to N, I read other posts and they wrote trimming will help to get rid of it. It is at the end of base thoughCorrect, use
output_forward_paired.fq.gz
andoutput_reverse_paired.fq.gz
.Your N's are not at the end? If it is a small fraction then the aligner hopefully will take care of not aligning those reads.
@genomax , I liked your answer , thanks for your help my N is at the end , I am wondering when one does the aligner, then what should we do? I see that people use samtools to sort, reheader, removing PCR duplicates, indexing etc etc. what would be the best to do after aligning ? my main question that I want to understand is the mutation in whole exome sequencing