Question

Removal of Adapter Sequences

0

Entering edit mode

10.1 years ago

Saulius Lukauskas ▴ 540

I am looking at remapping the short reads from one of the earliest histone modification ChIP-seq datasets to the genome (ie. Barski, 2007, Cell). From the rather brief methods section they contain, it seems they used Solexa 1G Genome Analyzer to do the sequencing, which is now almost 10-years-old.

Does anyone know which adapter sequences this analyser uses? Is there a database or something?

EDIT:

To clarify, I am getting confused about different sources using different sequences.

For instance, scythe adapter sequence set (https://github.com/vsbuffalo/scythe/blob/master/illumina_adapters.fa) contains AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT for the sequence, whereas post in seqanswers http://seqanswers.com/forums/showthread.php?t=198 contains different set of sequences.

ChIP-Seq Solexa adapter-sequences • 3.1k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by Saulius Lukauskas ▴ 540

Ram · Answer 1 · 2015-03-14

1

Entering edit mode

10.1 years ago

Adrian Pelin ★ 2.7k

You could try the FastQC utility on sample of 1M reads, or even on the whole dataset. The tool is usually good in finding commonly used adapters and over-represented sequences. Then once you have an idea what the sequence of the adapter is, you can try trimming it with SeqPrep, and it will tell you how many reads contained an adapter and were thus trimmed. Another approach to validate the adapter sequence is to map your reads to a reference genome and look for 3' and 5' mismatches in reads, should show you the first nt of the adapter mismatching to the ref.

ADD COMMENT • link 10.1 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Yeah, FastQC is what prompted me to look at adapters in the first place. I get a big over-representation of 7-mers at 5' end of read, namely:

Sequence     Count     PValue     Obs/Exp Max     Max Obs/Exp Position
TGATCGG      4760      0.0        14.944738       1
TAGGGCA      10770     0.0        13.59925        1
TAGGGCG      5845      0.0        13.444971       1
TAGGGAC      7710      0.0        13.393038       1
TAGGGGA      7290      0.0        13.347217       1
TAGGGAG      10435     0.0        13.295226       1
TAGCGAG      3495      0.0        13.293988       1
TAGCGCG      3710      0.0        13.226308       1
TAGCGGC      4930      0.0        13.14511        1
TAGGGGC      9440      0.0        12.822517       1
TAGCGAC      3280      0.0        12.802791       1
TAGCGAA      2580      0.0        12.775723       1
TAGGGGG      7060      0.0        12.687389       1
TAGGGAA      8150      0.0        12.532882       1
TAGGCGA      2915      0.0        12.521299       1
TACGGCG      2205      0.0        12.457054       1
TAGGGCC      7555      0.0        12.361421       1
TAGCGGA      3060      0.0        12.35397        1
TTGGGCG      6700      0.0        12.340709       1
TAGGCAG      10400     0.0        12.229798       1

It seems that "TAGG" is quite a problem here. Not sure how to go from here though.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 10.1 years ago by Saulius Lukauskas ▴ 540

0

Entering edit mode

Hmmm, you are looking at the 7-mers, this may not be as informative about adapters, because what you certainly have here is BIAS in what is being sequenced. It seems to me that fragments starting with a TAG[GC] motif as you mentioned are more likely to get sequenced than others. The reason why I think this is not just adapters is because these k-mers vary between each other with a couple of nucleotides. You would expect adapters to be pretty consistent. Do you find anything in over-represented sequences? If not, your adapters might be too short in the reads so they are not really being sequenced.

ADD REPLY • link 10.1 years ago by Adrian Pelin ★ 2.7k

0

Entering edit mode

Nope, no over-represented sequences. Adapter count is also green, though only Illumina Universal Adapter, Illumina Small RNA Adapter and NExtera Transposase Sequences are used, which, I believe, are the newer models.

ADD REPLY • link 10.1 years ago by Saulius Lukauskas ▴ 540

0

Entering edit mode

Alright, I would say you are ready to try mapping these guys and visually inspecting the 5' ends to see if you get many mismatches. Let me know how it goes!

ADD REPLY • link 10.1 years ago by Adrian Pelin ★ 2.7k