Hi there,
I am trying to understand the sanger sequencing ABI/AB1 file format better. As I understand, reading in a raw AB1/ABI file into python, I can access the channel corrected values for the different bases in ['DATA9'] to ['DATA12'], and understand which bases belong to a specific channel through ['FWO_1'] i.e the GATC or ATGC.
However, the file I am looking at has 412 bases as the output from SnapGene, yet has 4950 data records for each channel in the raw ['DATA9'] to ['DATA12']. How do you convert those 4950 data records into the bases correctly? Is there some form of normalisation - e.g every 10 raw records gives data for 1 base position. And if so, do you take the average over 10 records, or the highest peak in the 10 records from the channels and the channel with the highest peak/average is the correct base? Do you start at the beginning or do you Does it make sense to convert this figure into a quality score similar to NGS, and do you use the highest peaks in the 10 record interval or the average?
I hope this question makes sense.
Thanks