Question

How To Process Data From The Sequence Read Archive?

1

Entering edit mode

14.7 years ago

Jorge Amigo 14k

the Sequence Read Archive (SRA) is the NCBI's repository for publishing NGS data, and hence a great place where to look for test datasets for trying out your algorithms of interest.

we are currently trying to evaluate the different mapping results from several tools dealing wiht color space SOLiD data, and we would like to use available reads from the SRA, but all we find there are fastq files. if each run would be just a single fastq file containing all reads we should be able to use it straight away (shouldn't we?), but we are getting triplets of files that we are not sure how to process them.

an example case could easily be SRX004555 (AB SOLiD sequencing of Human HapMap individual NA18507 genomic paired-end library). when trying to download the available data from this experiment, you will find 4 fastq file triplets, 1 triplet per experiment run, and here is where we are not sure how to proceed: should we map each file independently and join the results? should we join the fastq files into a single massive one and then map it?

PS: does anybody know if csfasta and qual files are present in the SRA? where could one obtain such data from? the only site we have found is the proper SOLiD website, but the available datasets are not that many.

sra solid mapping fastq • 7.9k views

ADD COMMENT • link updated 11.7 years ago by Biostar 20 • written 14.7 years ago by Jorge Amigo 14k

score 3 · Answer 1 · 2010-10-26

3

Entering edit mode

14.7 years ago

Casey Bergman 18k

For a multi-run accession, we typically concatenate all the fastq files from a SRA accession before mapping. I'm not 100% sure about Solid data, but for Illumina data in the SRA, the _1 and _2 suffixes refer to the different files of paired reads sequenced from the same set of fragments. The files with no suffix from a triplet are typically smaller in size, have sequences with high error rates, and my understanding is that they are generated from calibration stages of the sequencing run.

One important gotcha we've experienced with Illumina paired end data from SRA is that the two reads from the same fragment in the _1 and _2 files have the same name in the fastq header. So if you simply concatenate and map you will have multiple reads with the same name in your output. It's not clear how different mapping software copes with this, so we typically re-write all fastq headers before mapping to include "_1" and "_2" if names from these files are not unique. I'd recommend checking if this is true in your case and reporting back to Biostar to see if this is a general issue with SRA.

ADD COMMENT • link 14.7 years ago by Casey Bergman 18k

0

Entering edit mode

Why are you concatenating the paired-end read files instead of mapping them in paired end mode?

ADD REPLY • link 14.7 years ago by Aaron Statham ★ 1.1k

0

Entering edit mode

We are looking for features directly in sequences that cannot be inferred from non-overlapping pair-end reads.

ADD REPLY • link 14.7 years ago by Casey Bergman 18k

0

Entering edit mode

"The files with no suffix from a triplet are typically smaller in size, have sequences with high error rates, and my understanding is that they are generated from calibration stages of the sequencing run." this was in fact what was confusing us, and unfortunatelly we haven't find any documentation on this matter. we will try to work then with _1 and _2 files only in paired-end mode for the mappers we are evaluating, leaving that smaller size file apart. we hope not to experience the mentioned "gotcha" since we won't concatenate the files, but thanks a lot for mentioning it.

ADD REPLY • link 14.7 years ago by Jorge Amigo 14k