Question

Understanding ABI Sanger File Format

0

Entering edit mode

3.0 years ago

oludhe ▴ 90

Hi there,

I am trying to understand the sanger sequencing ABI/AB1 file format better. As I understand, reading in a raw AB1/ABI file into python, I can access the channel corrected values for the different bases in ['DATA9'] to ['DATA12'], and understand which bases belong to a specific channel through ['FWO_1'] i.e the GATC or ATGC.

However, the file I am looking at has 412 bases as the output from SnapGene, yet has 4950 data records for each channel in the raw ['DATA9'] to ['DATA12']. How do you convert those 4950 data records into the bases correctly? Is there some form of normalisation - e.g every 10 raw records gives data for 1 base position. And if so, do you take the average over 10 records, or the highest peak in the 10 records from the channels and the channel with the highest peak/average is the correct base? Do you start at the beginning or do you Does it make sense to convert this figure into a quality score similar to NGS, and do you use the highest peaks in the 10 record interval or the average?

I hope this question makes sense.

Thanks

Sequencing Sanger Python ABI • 2.3k views

ADD COMMENT • link updated 3.0 years ago by trausch ★ 1.9k • written 3.0 years ago by oludhe ▴ 90

score 0 · Answer 1 · 2022-06-29

There is the raw signal for each base sampled at 4950 points and then the base caller is essentially looking for peaks in these raw signals to call 412 bases. Tracy can dump the raw signal and basecalls into a simple tab-delimited text file or JSON which is probably easier to parse in python.

tracy basecall -f tsv -o out.tsv input.ab1