Question

GFA to Fasta file

6

Entering edit mode

9.7 years ago

marcela.uliano ▴ 90

Hey guys,

Does anyone have a script (or a awk line) to extract fasta reads from a GFA file?

I've run this miniasm in some Pacbio reads I have, and I would like to extract the final unitigs from the final file I have (which is a GFA file) in the fasta format!

Thanks a lot guys!

sequence • 21k views

ADD COMMENT • link updated 20 months ago by Adam Taranto ▴ 40 • written 9.7 years ago by marcela.uliano ▴ 90

1

Entering edit mode

Awesome!!

Thank you so much!

ADD REPLY • link updated 6.2 years ago by h.mon 35k • written 9.7 years ago by marcela.uliano ▴ 90

0

Entering edit mode

Hi,

could you describe what does GFA stand for? I don't know about this format.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by Atu • 0

0

Entering edit mode

GFA is the Graphical Fragment Assembly format. Here is a post by Heng Li (@lh3) on this: https://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format.

ADD REPLY • link 7.1 years ago by ropolocan ▴ 830

Ram · Answer 1 · 2015-12-14

14

Entering edit mode

9.7 years ago

lh3 33k

awk '/^S/{print ">"$2"\n"$3}' in.gfa | fold > out.fa

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by lh3 33k

1

Entering edit mode

First, I'm sorry for posting a comment here after so many years!

I just started playing with PacBio reads though, and it's the first time I came across the gfa format... So it appears that all you need is the "S" lines, but I was wondering whether this is entirely true. I mean there are those "L" lines as well, linking certain segments together. Shouldn't we take into account these links and produce a fasta file of linked segments?

ADD REPLY • link 3.2 years ago by Panos ★ 1.8k

0

Entering edit mode

Usually the links describe ambiguous points in the graph where the assembler has not been able to decide on the correct path between 2 or more segments.

If you want to join contigs/segments together before exporting to fasta you should use a scaffolding tool such as:

GraphUnzip (Detangles graph, can be used before scaffolding)
DENTIST
SAMBA
ntLINK
SLR
SGTK

Those use long reads (some also use Hi-C). Other good scaffolders are available that only use Hi-C data.

ADD REPLY • link 20 months ago by Adam Taranto ▴ 40

score 0 · Answer 2 · 2024-01-14

Slightly modifying Heng's answer: If your sequence name is longer than 80 chars using fold will wrap the header onto a new line causing part of the header to be incorrectly read as sequence.

To avoid this we can first print the fasta header line and then separately wrap the sequence lines with fold.

Support long sequence names

awk '/^S/{print ">"$2; printf "%s", $3 | "fold -w 80"; close("fold -w 80"); print ""}' in.gfa > out.fa

Include GFA tags in fasta header line

If you also want to preserve the gfa segment tags (i.e. segment length LN, or depth DP) in the fasta description you can append them to the headers:

awk '/^S/{header=">"$2; for(i=4; i<=NF; i++) {header=header" "$i}; print header; printf "%s", $3 | "fold -w 80"; close("fold -w 80"); print ""}' in.gfa > out.fa

Note: close() is required to prevent the headers and sequences from getting out of sync or being skipped when sequences are very long.