Question

What Is A "Spot" In Sra Format

24

Entering edit mode

13.6 years ago

Daniel Standage 4.1k

I'm using the SRA toolkit to convert some SRA files to Fastq format. I've been looking at the documentation to make sure I'm doing things right, and the word spot keeps coming up. My question is twofold.

What is a spot and how does it differ from a read?
Where is this (officially) documented (or is it)?

The reason I've separated these two questions is that I think I know the answer to the first one, but I'm not sure and I can't find the answer in any of the documentation or online. Also, I expect more people will know the answer to question #1 than question #2.

sra fastq format • 34k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 13.6 years ago by Daniel Standage 4.1k

7

Entering edit mode

Hi I agree with Stefano. Spot does contain more than a read. I didn't find any official document to prove this but actually when we use fastq-dump on the sra file so as to convert it into a fastq files , after completion it is written that "Written 38424688 spots for SRR032.sra" Now if we look at the fastq file, each read has 3 more things attached to it starting with @. Something like this

@SRR032238.12186 HWI-EAS6:3:1:246:1981 length=50 GGCCAGCTCTACACCTTCAAGGCCGAGACGGAGGAGCTGAAGGGANGCTG

+SRR032238.12186 HWI-EAS6:3:1:246:1981 length=50 BBB@=@BBBABBBBBBBB>0>BBB@6@A?446/8+;AAA@=9(7-!817&

In total each read has 4 lines. Now count the number of lines in your fastq file and divide it by 4. That would give you the same number as i mentioned above i.e 38424688(IT WOULD BE DIFFERENT FOR DIFFERENT FILES OFCOURSE) SO a spot contains 4 lines in fastq of which read is a part.

Hope this helps

ADD REPLY • link 13.2 years ago by Varun Gupta ★ 1.3k

1

Entering edit mode

Isn't your explanation a round about way of saying that number of spots is exactly the same as number of reads? Which is not always true.

ADD REPLY • link 9.4 years ago by rdbcasillas11 ▴ 10

10

Entering edit mode

13.6 years ago

Stefano Berri 4.4k

Hi.

I think a "spot" is where the read comes from. The spot might contain more than the read. The difference is that the "spot" could all the "technical" information (adapter, tags, barcoding sequences) whereas the read is the actual biological sequence you are after. In many cases, however, spot and read coincide.

I don't know of any official documentation: the closest I could get is the description on how to make the xml files associated to the submission.

Good luck! If you discover anything in this regard, post it!

ADD COMMENT • link 13.6 years ago by Stefano Berri 4.4k

7

Entering edit mode

8.7 years ago

rrr ▴ 100

Official (but not as helpful as the above) explanations are here:

http://www.ncbi.nlm.nih.gov/books/NBK54984/

http://www.ncbi.nlm.nih.gov/books/NBK47533/

So a spot is all the info you got from one "spot" on the flow cell. This is not the same as reads. You get 4 reads per spot with today's illumina sequencing: forward barcode, forward read, reverse barcode, reverse read. Straight out of the instrument these are 4 different files. Straight out of SRA database... you specify what parts you want. SRA links all those reads together with a "spot" identifier, and you can use that to match up paired reads later. At least that is my interpretation of their and your descriptions.

ADD COMMENT • link 8.7 years ago by rrr ▴ 100

Ram · Accepted Answer · 2015-05-18

This is the description I received from the SRA staff (Adam Stine).

The spot model is Illumina GA centric. The flowcells have the locations where the adapters have stuck them to the glass of the lane. There are X and Y coordinates that identify these 'spots'. As the camera reads the fluorescent flashes during sequencing, the coordinates indicate which spot the new base is added to. All of the bases for a single location constitute the spot. There may be one or more divisions of those bases for technical reads (adapters, primers, barcodes, etc) and there will always be at least one biological read (forward, reverse). I usually think of the technical reads as the "known" sequence and the biological as the "unknown". When we store the data, the bases for a single spot are all stored as one string with the description of where the breaks occur as well as the type of read each segment represents. The spot length is the expected total length for all reads (used as a check to make sure we have all the data). As an example, a 2x150 run with a 6bp barcode and 12bp primer on the forward read would have 4 reads.

0 - barcode basecoord 1

1 - primer basecoord 7

2 - forward basecoord 19

3 - reverse basecoord 151

But you only need to explain SRA about the barcode and primer is you submit sequences that contains it..In my case, a third party provided me with the BAM files and I do not have the untrimmed sequences.

So the SPOT datamodel is useful for supplying untrimmed BAM.. yet, enable you to specify where the biological reads begin.

In my case, I have 2X100 bp without index and I am only supplying the Application read with the adapter trimmed. so I simply submit.

0 - forward basecoord 1 (Application read)

1 - reverse basecoord 101 (Application read)