Question

Identify sequencing adaptor contamination programatically

0

Entering edit mode

8.9 years ago

John 13k

Hello :)

Is there a tool that just looks at over-represented sequences? The only one i know of is FastQC, but this does a lot of other things and the output is a little unwieldy. Ideally I just want to supply 1 or more BAM files and get out a over-represented sequence profile (bonus points for annotating them as XYZ adaptors) for each sample.

FastQC Illumina sequencing • 2.8k views

ADD COMMENT • link updated 8.9 years ago by harold.smith.tarheel ★ 5.0k • written 8.9 years ago by John 13k

Ram · Answer 1 · 2016-01-20

2

Entering edit mode

8.9 years ago

harold.smith.tarheel ★ 5.0k

If you're only interested in adapter sequences, then try BBDuk (part of BBMap) with the stats flag.

ADD COMMENT • link 8.9 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

Additionally, if you have paired reads, BBMerge can print out the adapter sequence for you based on the overlaps, even if you have adapters not in the list of adapter sequences provided with the BBMap package (in adapters.fa). For example:

bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

I think this is exactly what I am looking for Brian. Thank you both so much! :)

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by John 13k

0

Entering edit mode

Eek, wait, i got a weird result. Maybe im not doing it right?

Params:

java -Djava.library.path=/ex/bbmap/jni/ -ea -Xmx1000m -cp /ex/bbmap/current/ jgi.BBMerge in1=/mnt/Alpha/RAW_DATA/ChIP/44_Mm01_WEAd_C2/replicate1-H3K27ac/paired/run130812_SN7001180_0069_AD265MACXX/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R1_001.fastq.gz in2=/mnt/Alpha/RAW_DATA/ChIP/44_Mm01_WEAd_C2/replicate1-H3K27ac/paired/run130812_SN7001180_0069_AD265MACXX/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R2_001.fastq.gz outa=./adapters.fa
Executing jgi.BBMerge [in1=/mnt/Alpha/RAW_DATA/ChIP/44_Mm01_WEAd_C2/replicate1-H3K27ac/paired/run130812_SN7001180_0069_AD265MACXX/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R1_001.fastq.gz, in2=/mnt/Alpha/RAW_DATA/ChIP/44_Mm01_WEAd_C2/replicate1-H3K27ac/paired/run130812_SN7001180_0069_AD265MACXX/sequence/44_Mm01_WEAd_C2_H3K27ac_F_1_CAGATC_L003_R2_001.fastq.gz, outa=./adapters.fa]

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by John 13k

0

Entering edit mode

Could it be because my reads don't overlap? I have 50nt P.E. sequencing of 700bp fragments.

ADD REPLY • link 8.9 years ago by John 13k

1

Entering edit mode

Short answer, probably. If your reads don't fully overlap, you won't have adapter contamination.

Looking at the insert size distribution, you should have maybe 1% of the overlapped reads shorter than 50bp, which would mean 1%*6.5%*17591731=11000 reads with insert size shorter than 50bp, which thus should contain adapter sequence. But, a 0.00065 fraction is close enough to BBMerge's false-positive rate that the signal might be too noisy to get a confident consensus (in which case it just gives you a 1bp truncated adapter sequence, as in this case, so that the fasta file is valid). You could retry adding the flags vstrict mininsert=20, which would reduce the noise (it will still probably fail, but that's only a wasted 12 seconds); but, your 5' adapter rate in this case is so low you don't need to worry about it anyway. If you have odd FastQC results for this data, something else is the culprit.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by Brian Bushnell 20k

0

Entering edit mode

I really apprechiate your quick replies (particularly over the weekend!) Brian - I really think I should be using BBMap more if this is the sort of end-user help people can expect :]

Well, FastQC's position-dependant-kmer-analysis detects stuff, and correctly identifies the kmers contributing to an Illumina TruSeq Adapter, even though it detects that in over-represented sequences and not in Adapter contamination (???)

I just find it super weird that there isnt a tool specifically for beginning-of-read over-represented-sequence analysis. Something like FastQC's kmer analysis tool, but peices together kmers to form the full string - perhaps even going back to the data with this hint and testing the full kmer more throughly.

I tried with vstrict and mininsert but alas I got the same result. I will now proceed to open 155 fastqc reports whilst scratching my fingernails across a chalkboard to lighten the mood.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by John 13k