Dear all, I'm analyzing data from TruSeq Custom Amplicon 1.5 panel.
I've read much about handling sequences linked to the original region of interest and finally got confused so any help would be highly appreciated. Maybe I'm missing something really basic here. For example, I'm not sure which sequences comprise a read aside from the insert (=targeted region) itself and flanking target-specific primers.
Illumina says that "Adapter trimming is not required for TruSeq Targeted RNA Expression, TruSeq Custom Amplicon, and TruSeq Cancer Panel when using Illumina analysis pipelines". Is it the case? Here they claim that "Each probe contains unique, target-specific sequence as well as a universal adapter sequence that is used in a subsequent amplification reaction". The aformentioned target-specific sequence is an upstream or downstream locus-specific oligo (ULSO/DLSO). Here they give only the index sequences for TSCA, no universal adapter sequence.
Please have a look at the image attached.
MultiQC plot for adapter content
This is an example MultiQC plot for adapter content, FastQC claims that it is the Illumina universal adapter (based on finding of AGATCGGAAGAG sequence, as far as I understood from FastQC's documentation). I've used zcat LK_S2_L001_R2_001.fastq.gz | grep AGATCGGAAGAG
and noticed that this sequence is a part of a longer pattern. zcat LK_S2_L001_R2_001.fastq.gz | grep -c AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAAGG
returned 8863 (while zcat LK_S2_L001_R2_001.fastq.gz | grep -c AGATCGGAAGAG
returned 11079).
How should I handle it? They say that "if you use BWA-MEM, the trailing (5’) bases of a read that do not match the reference are soft-clipped, which covers those cases in which an adapter does occur". Yet Heng Li says: "...Bwa-mem will just soft clip them[adapter sequences]... However, it is still recommended to trim adapter sequences. After all, adapters are not part of the samples you are sequencing. They might affect variant calling in corner cases".
Also, I've read that you should get rid of ULSO and DLSO sequences. What is the typical approach here? I know that BAMclipper can do this after alignment.
To sum up, I'm trying to figure out (for TruSeq Custom Amplicon 1.5): - what sequences to trim, - when to trim them, - which tool to use.
How do you clean TruSeq Custom Amplicon data in terms of getting rid of extraneous sequences? I'm stuck. Please be merciful and sorry for the chaotic question!
Yet another reason to not trust what Illumina says. They create good sequencing machines (though recently, each iteration has been worse) but their software is terrible, and in general, given that their latest sequencers no longer report quality scores correctly, I'm not entirely sure they have anyone in charge of their products that understands their users.
Mapping will always be more accurate if adapter sequences are removed first.
Yes, you should trim primer sequences. Those are not true "observations" of your individuals, you may introduce artefacts in rare cases.
I haven't tried it myself, but I think https://github.com/tommyau/bamclipper does the job.
If your amplicon panel does not have any overlapping amplicons, removing primer or not depends on whether you expect and tolerate missing "rare cases", e.g. SNV or INDEL near gene-specific primers. It depends on the genes of interest and the panel design in the context of variant types and location. In our case, mutations could be anywhere and any type: we came across with a clinical case of BRCA1 deletion that is close to the gene-specific primer. That INDEL could be missed only if gene-specific primer sequence is perfectly removed from FASTQ before BWA-MEM (https://www.nature.com/articles/s41598-017-01703-6/figures/2).
If there are overlapping amplicons in the panel, you should consider removing the gene-specific primers for the sake of accurate sequencing depth calculation and variant calling, no matter it is removed in FASTQ level or BAM level or variant calling algorithm level.
Yes, the amplicons are overlapping so I'll try to remove primers with BAMClipper. Thank you.
If ULSO and DLSO are upstream and downstream to the targeted sequence and I'm going to restrict variant calling using an intervals list, than do I need to trim the primers?
restricting with an interval list will still cause you to miss some real variants you could detect if you trim USLO/DLSO. See my IGV screenshots in this post.