Hi all
I downloaded some paired end sequencing files as a fastq format.
I would like to make sure that there are no hidden problems which might be more difficult to detect at a later state, so I ran FastQC to check their sequence quality before analyzing them.
Unfortunately, there are some bad conditions in some samples. Especially, per base sequence content and sequence duplication levels and adapter content are common bad categories I mainly encountered.
I heard that there are some tools that do removing reads that are bad or biased so I tried to search and finally found that "Trimmomatic" seemed to what I am looking for. However, after I examined their manual in detail, that is not for my case. This is because my problems is mainly focused on sequence duplication levels, per base sequence content,kmer content not a base quality or related sequence quality something.
Therefore I need to find another tools that suits for my case. but I didn't.
Can you suggest or recommend any preprocessing-related tools for me?
Especially in case of resolving sequence duplication level or per base sequence content.
Regarding the duplications, I guess that you are getting to many overrepresented-sequences? If you are interested in removing them, you can extract these sequences from
fastQC
report, and then filter them withTrimmomatic
. But I'm not sure if you want to do this, normally I only remove adapter sequences, not all overrepresented sequences (polyA...). Regarding base sequence content, it is normal to not have a perfect distribution. If you're working with RNASeq data, its common to have a bias at the beginning of the read, you can remove the first 10bp of the reads withTrimmomatic
to fix it.I usually don't care to much about these parameters,
adapter content
andper base quality
are the most critical ones.Thank you for your comments.
I am using whole-exome sequencing for my study and I have some questions for you about your reply.
I am getting too many overrepresented-sequences by looking at the sequence duplication levels. but this parameter only give distributions, so I don't have any way to get the its sequences.
Anyway, most critical conditions are adapter content and per base quality as you mentioned, and in my cases, I have only problems with adapter content, so is it solution to resolve adapter content by using Trimmomatic?
Actually I am not using Trimmomatic, so I am not sure this tool gives me solutions for my case.
You can know the sequences looking at the "Overrepresented sequences" plot. You can see the sequence, how many times the sequence appears, the percentage and the source. In this last field, you can see which adapters were used in your library preparation. If the overrepresented sequence is an adapter, most probably you'll have a tag indicating it (TruSeq, illumina...). So, yes, you can use Trimmomatic to remove them. When you download Trimmomatic, there is a file called
adapters.fa
(or something similar), which is a fasta file containing most of the adapters used in sequencing (most probably your adapters are included here). You can give this file to Trimmomatic usingILLUMINACLIP
argument, and it will look for each of the adapters in adapters.fa file in your fq files and remove them. Another possibility is to create your own adapters.fa file. You can extract your specific adapter sequences from fastQC overrepresented sequences, create a fasta file with them, and give it to Trimmomatic.Thanks.
I am so curious because there are bad conditions for adapter content while fine with over-represented sequences. So I didn't get any sequences related to adapter :(