Does anyone have a script (or a awk line) to extract fasta reads from a GFA file?
I've run this miniasm in some Pacbio reads I have, and I would like to extract the final unitigs from the final file I have (which is a GFA file) in the fasta format!
ADD COMMENT
• link
updated 5.0 years ago by
Ram
44k
•
written 9.0 years ago by
lh3
33k
1
Entering edit mode
First, I'm sorry for posting a comment here after so many years!
I just started playing with PacBio reads though, and it's the first time I came across the gfa format... So it appears that all you need is the "S" lines, but I was wondering whether this is entirely true. I mean there are those "L" lines as well, linking certain segments together. Shouldn't we take into account these links and produce a fasta file of linked segments?
Usually the links describe ambiguous points in the graph where the assembler has not been able to decide on the correct path between 2 or more segments.
If you want to join contigs/segments together before exporting to fasta you should use a scaffolding tool such as:
GraphUnzip (Detangles graph, can be used before scaffolding)
DENTIST
SAMBA
ntLINK
SLR
SGTK
Those use long reads (some also use Hi-C). Other good scaffolders are available that only use Hi-C data.
Slightly modifying Heng's answer: If your sequence name is longer than 80 chars using fold will wrap the header onto a new line causing part of the header to be incorrectly read as sequence.
To avoid this we can first print the fasta header line and then separately wrap the sequence lines with fold.
Awesome!!
Thank you so much!
Hi,
could you describe what does GFA stand for? I don't know about this format.
GFA is the Graphical Fragment Assembly format. Here is a post by Heng Li (@lh3) on this: https://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format.