I have an interesting (to me) problem that I'm not sure how to approach. I have a library of randomized short inserts (21 nt) that has been sequenced using the SOLiD platform, with 25 nt reads. The insert will be at the very start of the reads. I want to count the distinct insert sequences. The straightforward way appears to be to convert the reads to fastq, filter based on quality, and count in base space. I'm worried about errors, as I have no way of checking for them that I can see, other than the last 4 nt (22-25) which should be identical in all reads. Any suggestions or interesting approaches to accomplish this? I am not experienced with NGS projects, so sorry if this is a dumb question.
@Farhat Thanks. I always considered the difficulties of colorspace, but forgot that it could be useful to stay in there as long as possible. I need to count the frequencies of distinct sequence inserts (which I guess I wasn't so clear about in my question), rather than the total count of distinct sequences. But your answer was helpful in steering me away from premature conversion to basespace. And I can easily modify the set based approach to a hash based one to get what I want.