the Sequence Read Archive (SRA) is the NCBI's repository for publishing NGS data, and hence a great place where to look for test datasets for trying out your algorithms of interest.
we are currently trying to evaluate the different mapping results from several tools dealing wiht color space SOLiD data, and we would like to use available reads from the SRA, but all we find there are fastq files. if each run would be just a single fastq file containing all reads we should be able to use it straight away (shouldn't we?), but we are getting triplets of files that we are not sure how to process them.
an example case could easily be SRX004555 (AB SOLiD sequencing of Human HapMap individual NA18507 genomic paired-end library). when trying to download the available data from this experiment, you will find 4 fastq file triplets, 1 triplet per experiment run, and here is where we are not sure how to proceed: should we map each file independently and join the results? should we join the fastq files into a single massive one and then map it?
PS: does anybody know if csfasta and qual files are present in the SRA? where could one obtain such data from? the only site we have found is the proper SOLiD website, but the available datasets are not that many.
Why are you concatenating the paired-end read files instead of mapping them in paired end mode?
We are looking for features directly in sequences that cannot be inferred from non-overlapping pair-end reads.
"The files with no suffix from a triplet are typically smaller in size, have sequences with high error rates, and my understanding is that they are generated from calibration stages of the sequencing run." this was in fact what was confusing us, and unfortunatelly we haven't find any documentation on this matter. we will try to work then with _1 and _2 files only in paired-end mode for the mappers we are evaluating, leaving that smaller size file apart. we hope not to experience the mentioned "gotcha" since we won't concatenate the files, but thanks a lot for mentioning it.