I am trying to do variant calling using exome-sequencing data produced by HiSeq 2000. I think I need to first trim the adapters before doing BWA alignment. I have found cutadapt program and think it is good. However, before I use cutadapt to process a large amount of data, I would like to confirm the settings with this community:
I am not exactly sure what adapters are used, but from an Illumina tech document, I found the following sequences common in all their adapter:
BBduk is also very good at trimming adapters. It is part of BBmap. I have found that it works very well for PE Illumina sequence data, it even has the common adaptets built-in.
Could you please provide the code you are using? I'm trying to do that but it seems that I have to trim both left and right adaptors (ktril=r ktrim=l). In other case in second read adaptors are still present.
Hi, I have questions about adaptor trimming.
1. I run fastqc for my sample and the results showed Illumina Universal Adaptor contamination. Should I trim the Illumina Universal Adaptor or find the adaptor sequences based on authors' library preparation method?
2. The Illumina Universal Adaptor sequence is AGATCGGAAGAG. Why does the tutorial suggest to trim AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC instead?
The adapter sequence is much longer than AGATCGGAAGAG and even longer than the second example. Now adapter recognition works by matching the start of the sequences. So both specifications will work close to the same way.
Cutadapt is great, and it's what most people use (with or without TrimGalore). However, not all Illumina adapters are necessarily the same - e.g. for an sRNA-seq experiment, the sequences would be different from those ones. They should work in most cases though. What I ended up doing is running a script on my raw sequence files to make sure that the adapters I'm trimming are actually there, and then using cutadapt to trim them.
I also wrote a blog post about it, in case it's of interest.
thanks for the link, these are problems that often bite one unexpectedly and very annoying to track them down - seemingly no one knows what has been put on, and they keep punting the question around
Can I try to remove every adapters from illumina (maybe 100 adapters totally) when the adapter is unknow? I mean in theory. Maybe it would be not reality to do it in the practice.
cutadapt is fine. I have recently moved to Trim Galore, which is really just a wrapper around cutadapt which simplifies handling of paired-end reads and some other things. By default, Trim Galore looks for a 13-mer from the Illumina standard: AGATCGGAAGAGC, which is found in your adapter 1 sequence (starting from position 2 in that sequence; I am not sure why the G is not included).
Yup I have also found Trim Galore easy to use and it also takes care of the orphan reads (read pair where one read gets discarded as it can't pass the QC step) in case of paired end data. Aligners like BWA will require your forward and reverse read to follow the same order in the fastq1 and fastq 2 files.
thanks for the reply. So you use the "--paired" option for your trim_galore run ? Do you recommend it for using BWA later? ps: i have paired end reads.
Cutadapt is fine . You can also try using Fastx or NGSQC toolkits. Fastx allows you to handle with paired end data as well. As rightly pointed out by Jelena, the type and length of adapters depends on the kind of work you are performing. All these tools can effectively help you out in trimming off the adapter sequences.
you are quite right, I have added some more technical info to the bottom of the readme file. are there any other particular things that you would like to know?
other observations, I would move the licensing to the end, it is really not that important, and move what the tool does first, this is what people look for, when I go to a tool I want to know what the tool does right away:
We developed a tool to automatically detect which adaptors and primers are present in a FASTQ file and remove those
sequences from the file, as well as detecting the quality score encoding type used and removing low quality sequences.
Could you please provide the code you are using? I'm trying to do that but it seems that I have to trim both left and right adaptors (ktril=r ktrim=l). In other case in second read adaptors are still present.
Hi, I have questions about adaptor trimming. 1. I run fastqc for my sample and the results showed Illumina Universal Adaptor contamination. Should I trim the Illumina Universal Adaptor or find the adaptor sequences based on authors' library preparation method? 2. The Illumina Universal Adaptor sequence is AGATCGGAAGAG. Why does the tutorial suggest to trim AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC instead?
The adapter sequence is much longer than
AGATCGGAAGAG
and even longer than the second example. Now adapter recognition works by matching the start of the sequences. So both specifications will work close to the same way.