Hi everyone
Hi!
I am trying to understand pangenomic data/file formats. In the vg's
descripton of file format, it says gam is similar to sam/bam and
usually in binary (compressed) form. Is there a difference between
sam and gam in terms of format?
yes. a .sam
or .bam
is a linear alignment map. a .gam
is a graph-based alignment map. this accounts for the differences in the structure/organization of the file; that is, .gam
files contain additional fields that describe the paths a read takes through the pangenome graph, including nodes visited and edge traversals - these extra fields allow the GAM to store complex paths that are non-linear in structure.
gam files always compressed?
a .bam
is a binary sam file; a .sam
file should be plain text. by contrast, a .gam
is usually binary, but can be converted to a readable format (like json). there are also further compressions of .bam
files. suppose you are running a clinical sequencing operation, and you need to keep hundreds of. (very large) .bam
files in storage long-term. in this case, you may want to try to further compress them. there are CRAM
files and other compression methods that are used for this.
If not, how can we know if it binary or not?
there are lots of ways. probably the simplest thing is something like cat myfile.gam
. if that outputs nonsense, its binary, but this isn't the most exact method. it's better, if youre in a linux environment, to use something like:
file mygile.gam
which will indicate either that its a "data" file (binary) or "ASCII text". alternatively, you can use vg itself; something like vg view -a myfile.gam
will output the file to a json serialization
hope it helps!
Thanks a lot, it is really helpful. Actually, I am trying to know if it is binary inside C++ code, so which method would be better to use inside the code. Can I use the same parsing in the code as in the sam format?