I am looking at remapping the short reads from one of the earliest histone modification ChIP-seq datasets to the genome (ie. Barski, 2007, Cell). From the rather brief methods section they contain, it seems they used Solexa 1G Genome Analyzer to do the sequencing, which is now almost 10-years-old.
Does anyone know which adapter sequences this analyser uses? Is there a database or something?
EDIT:
To clarify, I am getting confused about different sources using different sequences.
For instance, scythe adapter sequence set (https://github.com/vsbuffalo/scythe/blob/master/illumina_adapters.fa) contains AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT for the sequence, whereas post in seqanswers http://seqanswers.com/forums/showthread.php?t=198 contains different set of sequences.
Yeah, FastQC is what prompted me to look at adapters in the first place. I get a big over-representation of 7-mers at 5' end of read, namely:
It seems that "TAGG" is quite a problem here. Not sure how to go from here though.
Hmmm, you are looking at the 7-mers, this may not be as informative about adapters, because what you certainly have here is BIAS in what is being sequenced. It seems to me that fragments starting with a TAG[GC] motif as you mentioned are more likely to get sequenced than others. The reason why I think this is not just adapters is because these k-mers vary between each other with a couple of nucleotides. You would expect adapters to be pretty consistent. Do you find anything in over-represented sequences? If not, your adapters might be too short in the reads so they are not really being sequenced.
Nope, no over-represented sequences. Adapter count is also green, though only Illumina Universal Adapter, Illumina Small RNA Adapter and NExtera Transposase Sequences are used, which, I believe, are the newer models.
Alright, I would say you are ready to try mapping these guys and visually inspecting the 5' ends to see if you get many mismatches. Let me know how it goes!