Question

Screening Illumina reads for contamination

0

Entering edit mode

6.5 years ago

T_18 ▴ 50

Dear all,

I have a question regarding the screening of contaminants of my Illumina Hiseq reads. My data is RNAseq data and library prep is done using the NEBnext kit.

Initially I have cleaned and trimmed the data using trimmomatic, with a specific adapter database . This included trimming the reads for specific NEBnext adapters (as provided in a specific input file):

>Prefix_AdapterPE1/1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Prefix_AdapterPE1/2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACAATCCGTCATCTCGTATGCCGTCTTCTGCTTG

Afterwards I ran a local blast using the Univec contaminant database as blast db. In total (for just a single sample) I got 108182 hits (of 44 milion in total). Strangely the hits consisted of the following sequences (removed doubles):

GU593054.1
GU593054.1
J02459.1
JN581377.1
JX069762.1
JX069762.1
JX069764.1
KF680545.1
KF853601.1
L07041.1
U03498.1
Z22761.1

So this includes cloning vectors etc.. Moreover, I still found remaining NEBnext adapters. But also TruSeq DNA adapters, cloning vectors (as above) and Rubicon Genomics Thruplex PCRtags (e.g. NGB01061.1, NGB00761.1, NGB01061.1). See also here: https://www.ncbi.nlm.nih.gov/tools/vecscreen/uvcurrent/

Can somebody help/explain me how this is possible given the fact that NEBnext was used for the lib prep? Is it best to remove the potential problematic reads (in fact 100000 on 44 million is just very little)?

Thanks very much in advance!

UniVec rna-seq Illumina • 2.0k views

ADD COMMENT • link updated 6.5 years ago by WouterDeCoster 47k • written 6.5 years ago by T_18 ▴ 50

0

Entering edit mode

Since you are going to align the raw reads to a reference genome all the adapter and vector sequences will go away after alignment.

ADD REPLY • link 6.5 years ago by Arup Ghosh 3.2k

0

Entering edit mode

Thanks for your answers,

I do not have a reference genome, so this will be de novo. I think therefor that it is beneficial to remove (potential) dodgy reads, and I think removing 200k reads, of 44 million in total will not influence my set a lot. But like to hear your opinion on this.

@Friederike, thanks. Indeed I noticed some quit close resemblance between e.g. cloning vectors and Thruplex index adapters (e.g. both hit with blast). The fact that I find cloning vectors etc..could this potentially be te result of cross contamination (foreign DNA with different adapter usage)?

Still I am in doubt if my trimmomatic run has actually done a proper job: - in the specified Trimming adapter file I have included the following adapter pair:

>Prefix_AdapterPE3/1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Prefix_AdapterPE3/2
**GATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCACCATCTCGTATGC**CGTCTTCTGCTTG

But running my sequences against the UniVec database I still get a blast hits with NEBnext adapters, like the following sequence:

@ST-E00126:678:HL552CCXY:1:1101:20232:3841 1:N:0:AACTCACC
CATCGACCTCCTGCTTGAAGTCAGAACGTAAGATCTGGGGCACAGGGTCAGAGGGGGCGGCCACGGCGATGGCCACAAGCGCGAGGGCAACTAAGATA***GATCGGAAGAGCACACGTCTGAACTCCAGTCACAACTCACCATCTCGTATGC***

(See the bold overlap)

Just to conclude my main issues: should I be worried about the quality of my data finding hits with different adapters and cloning vectors or even (still) my own adapters (NEBnext)? Or just solving the issue by removing these suspicious reads?

Thanks again.

ADD REPLY • link updated 6.5 years ago by WouterDeCoster 47k • written 6.5 years ago by T_18 ▴ 50

0

Entering edit mode

If you are willing I suggest that you try bbduk.sh from BBMap suite. Here is a guide on how to use it. Put the NEBnext adapter sequence in a separate file (or add it to the adapters.fa file in resources directory in the software).

ADD REPLY • link 6.5 years ago by GenoMax 147k

score 0 · Answer 1 · 2018-06-07

Can somebody help/explain me how this is possible given the fact that NEBnext was used for the lib prep?

How similar are the different adapters? I.e., may there simply be some overlap between the sequences of the TruSeq and NEBNext kits?

Is it best to remove the potential problematic reads

Usually, the alignment step takes care of those reads, i.e. they should not be mapping to any region in your genome and will therefore not be considered for downstream analyses. Since you don't seem to be suffering from massive contamination, you probably don't need to take additional action.