If I have multiple genomic DNA sequences for a small protein, and want to represent this variation in a file, are there any existing formats to do this?
I could save all the sequences one per line and re-calculate the variation information each time, but this is a waste of computational resources.
Before I create YASF (yet-another-sequence-format) I was wondering if anyone knew of an existing one?
It should ideally be able to represent A|C|G|T with an optional 3 or 4 floats for the relative abundance of each. I wouldn't want to store the floats if there was no variation at any particular point in the sequence.
If it could handle gaps/insertions that would be useful too!
[Edit] I should have mentioned that the proteins in question are going to be antibodies, and so the data will consist of large numbers of different sequences based on similar VDJ recombination building-blocks with somatic hypermutation providing a vast number of similar, yet different, DNA sequences.
I think VCF is going to be what you want to use here
I think MAF is going to be more appropriate than VCF, as the data I will be using will have ambigious/unknown positional information in the chromosome, as random mucleotides are inserted or deleated during B-cell development.