exome sequencing for pair sample
0
0
Entering edit mode
5.8 years ago
Learner ▴ 280

I would like to know how to get the best quality reads and map them to a reference sequence. I have read this Fastq Quality Control Shootout among many other post. I have a pair ended sample and I want to get the high quality reads, what should I do ?

Also I would appreciate if you could comment on how to set the parameters for example for trimming look at this one mentioned in above post

perl TrimmingReads.pl -i data/s1.fq -l 10 -r 10 -o data/tmp.fq

Is it good to do this ?

I use FastQC on fastq files then I run the data through Trimmomatic, then run FastQC again, and then I compare to the original report to see how clipping modified the data.

what is the best way to do this? Trimmomatic or something else ?

genome • 1.3k views
ADD COMMENT
1
Entering edit mode

I want to get the high quality reads

Why is a good question and what is your definition of high quality? If you are aligning to a reference genome then you could potentially use reads with qualities down to Q15 or so. If you are interested in looking for SNP's then you may want to up that threshold to a higher level.

You can use current tools like bbduk, cutadapt, trimmomatic which will allow you to do quality based trimming effectively. In general I have not found a need to do this with sequence data that is from well made libraries and was recently sequenced.

ADD REPLY
0
Entering edit mode

@genomax so you would not do any QC on exome sequencing? I have read many papers and all do the QC . so why would they do that ? which steps do you think are necessary for me to take in QC and go forward especially when I have a paired ended sample . I added more info in my question

ADD REPLY
1
Entering edit mode

so you would not do any QC on exome sequencing?

I did not say that one should do not do QC. You should absolutely do QC for all NGS data before deciding if you need to take corrective action for a specific issue(s) you found in QC.

If you use the command line quoted above you are doing a blanket removal of 10 bases from beginning and end of reads without any consideration of quality. Does FastQC report suggest that you need to do this removal?

Having paired-end data has no special bearing on quality. You do QC on paired-end reads independently in FastQC. FastQC results have to be considered in context of the experiment. Because you see some fails (red X's) does not automatically flag your data as bad. Take a look at a collection of informative blog posts about FastQC that FastQC authors have put together here: https://sequencing.qcfail.com/software/fastqc/ One or more may be applicable for your data.

ADD REPLY
0
Entering edit mode

@genomax I liked your answer, actually I am looking at the report. I see that the N content in my data is a bit high. the GC distribution is ok, ACGT is equally seen in the data (at least In one pari) .I didn't not have any adaptor in my forward read. I just trimmed my data and I got this. Do you know what does that mean ? Input Read Pairs: 23267875 Both Surviving: 22889093 (98.37%) Forward Only Surviving: 301123 (1.29%) Reverse Only Surviving: 61783 (0.27%) Dropped: 15876 (0.07%) TrimmomaticPE: Completed successfully. Note that it gives me 4 outputs, which package now I should use for mapping or other stuff?

ADD REPLY
1
Entering edit mode

Depends on where those N's are and how many of them there are. I am surprised to see N's. That may be an indication that your library was not great (assuming there was no issue with the run). Looks like most of your data is passing the settings you used for trimmomatic. You can use any NGS aligner for the mapping. You will find this tutorial useful.

ADD REPLY
0
Entering edit mode

@genomax I liked your answer, you are amazing ! I have just a stupid question , when I use trimmomatic it gives me 4 outputs , when I want to use any aligner algorithm , they only need 2 input (forward and backward) which one of those 4 files I should use for aligner ? Thanks a lot

ADD REPLY
0
Entering edit mode

You will likely want to use reads that survived trimming and are still paired. Those should have file names that look like *_1P.fq.gz and *_2P.fq.gz. Or if you gave explicit file names (instead of using -baseout) then use those. Did you address the N's?

ADD REPLY
0
Entering edit mode

@genomax it gives me output_forward_paired.fq.gz, output_forward_unpaired.fq.gz , output_reverse_paired.fq.gz and output_reverse_unpaired.fq.gz so should I use those ones that are paired in their name? I ran it like this java -jar ~/Desktop/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 ~/Desktop/Trimmomatic-0.38/forward_001.fastq.gz ~/Desktop/Trimmomatic-0.38/reverse_001.fastq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:~/Desktop/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 . Related to N, I read other posts and they wrote trimming will help to get rid of it. It is at the end of base though

ADD REPLY
1
Entering edit mode

Correct, use output_forward_paired.fq.gz and output_reverse_paired.fq.gz.

Your N's are not at the end? If it is a small fraction then the aligner hopefully will take care of not aligning those reads.

ADD REPLY
0
Entering edit mode

@genomax , I liked your answer , thanks for your help my N is at the end , I am wondering when one does the aligner, then what should we do? I see that people use samtools to sort, reheader, removing PCR duplicates, indexing etc etc. what would be the best to do after aligning ? my main question that I want to understand is the mutation in whole exome sequencing

ADD REPLY

Login before adding your answer.

Traffic: 1652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6