I love .gfa, but sometimes I have trouble to understand them.
I have used Flye with pacBio reads with defaults options to make a first shot at assemble a linear bacterial genome. The genome probably contains some plasmids or some phages sequences. Flye gave me the following .gfa :
The green, yellow and red edges have been merged into one scaffold of around 9 Mb (as expected). And the little blue edge is his own contig of 42 Kb.
In his assembly_infot.txt file, Flye report that the big scaffold is indeed non linear, and that the tiny one is circular :
#seq_name length cov. circ. repeat mult. alt_group graph_path
scaffold_2 8926751 84 N N 1 * *,1,2,4,??,4,-3,-1,*
contig_4 42310 940 N Y 12 * 4
I have a few questions that puzzles me about this graph :
- Why does the .gfa connect all the edges into a circle if flye report that only one piece is circular not the other ?
- The yellow edge is connected by the same end to both green and red... And if the mean coverage of the green and red is 70X, it is only 17X for the yellow edge. What could it mean ? I am very puzzled by the fact it is connected by the same end. Could it be Flye trying to circularize it ? Or a sort of SV ? I think the DNA provided comes from a single colony so I don't see how that could be a SV.
- The blue edge has connection to itself, I imagine it is because of repetitions. But in the graph, this repetition is somehow connected to the others edges. So why was it split in the .fasta file in the end ?