Dear all,
I have a question regarding the screening of contaminants of my Illumina Hiseq reads. My data is RNAseq data and library prep is done using the NEBnext kit.
Initially I have cleaned and trimmed the data using trimmomatic, with a specific adapter database . This included trimming the reads for specific NEBnext adapters (as provided in a specific input file):
>Prefix_AdapterPE1/1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Prefix_AdapterPE1/2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAATCCGTCATCTCGTATGCCGTCTTCTGCTTG
Afterwards I ran a local blast using the Univec contaminant database as blast db. In total (for just a single sample) I got 108182 hits (of 44 milion in total). Strangely the hits consisted of the following sequences (removed doubles):
GU593054.1
GU593054.1
J02459.1
JN581377.1
JX069762.1
JX069762.1
JX069764.1
KF680545.1
KF853601.1
L07041.1
U03498.1
Z22761.1
So this includes cloning vectors etc.. Moreover, I still found remaining NEBnext adapters. But also TruSeq DNA adapters, cloning vectors (as above) and Rubicon Genomics Thruplex PCRtags (e.g. NGB01061.1, NGB00761.1, NGB01061.1). See also here: https://www.ncbi.nlm.nih.gov/tools/vecscreen/uvcurrent/
Can somebody help/explain me how this is possible given the fact that NEBnext was used for the lib prep? Is it best to remove the potential problematic reads (in fact 100000 on 44 million is just very little)?
Thanks very much in advance!
Since you are going to align the raw reads to a reference genome all the adapter and vector sequences will go away after alignment.
Thanks for your answers,
I do not have a reference genome, so this will be de novo. I think therefor that it is beneficial to remove (potential) dodgy reads, and I think removing 200k reads, of 44 million in total will not influence my set a lot. But like to hear your opinion on this.
@Friederike, thanks. Indeed I noticed some quit close resemblance between e.g. cloning vectors and Thruplex index adapters (e.g. both hit with blast). The fact that I find cloning vectors etc..could this potentially be te result of cross contamination (foreign DNA with different adapter usage)?
Still I am in doubt if my trimmomatic run has actually done a proper job: - in the specified Trimming adapter file I have included the following adapter pair:
But running my sequences against the UniVec database I still get a blast hits with NEBnext adapters, like the following sequence:
(See the bold overlap)
Thanks again.
If you are willing I suggest that you try
bbduk.sh
from BBMap suite. Here is a guide on how to use it. Put the NEBnext adapter sequence in a separate file (or add it to theadapters.fa
file inresources
directory in the software).