Question

Extract Sequences From Fasta Using Awk One-Liner

4

Entering edit mode

13.1 years ago

Newvin ▴ 360

Hi all. Really basic question here. I'd like to grab the sequences from a FASTA file with an AWK one-liner. To grab the headers, I can do:

awk < seq.fasta '/^>/ { print $0 }'

How do I negate this, so that it grabs the lines that do NOT begin with the '>' character. Feel free to chime in with other methods to solve the problem, but I'd like to learn an AWK-specific solution as I am trying to level up my AWK.

Thanks!

parsing fasta awk sequence • 18k views

ADD COMMENT • link updated 3.5 years ago by jena ▴ 320 • written 13.1 years ago by Newvin ▴ 360

0

Entering edit mode

IMHO, you really want to "level up" in regular expressions, not awk specfically. The more experience you develop with regex, you'll be able to apply it to awk, sed, and grep (as well as most programming languages) equally well.

ADD REPLY • link 13.1 years ago by Andrew Su 4.9k

score 8 · Answer 1 · 2011-11-09

8

Entering edit mode

13.1 years ago

Aaronquinlan 12k

awk < seq.fasta '!/^>/ { print $0 }'

or (preferred for clarity):

awk < seq.fasta '$0 !~ /^>/ { print $0 }'

or merely:

awk < seq.fasta '$0 !~ /^>/'

or grep

grep -v ^\> seq.fasta

or some people prefer "perl one liners" for this sort of thing because you can conceivably use Perl for awk-ish filters and for your day to day scripting.

perl -lne 'print if !($_ =~ /^\>/)' seq.fasta

ADD COMMENT • link 13.1 years ago by Aaronquinlan 12k

3

Entering edit mode

Thanks. I must have missed that in that documentation. How AWK-ward.

ADD REPLY • link 13.1 years ago by Newvin ▴ 360

1

Entering edit mode

Perl line is not quite right. Your command will print all lines that don't have '>' anywhere. To print just those lines that don't start with '>':

perl -lne 'print if !($_ =~ /^>/)' seq.fasta

ADD REPLY • link 13.1 years ago by Chris Maloney ▴ 360

0

Entering edit mode

right, thanks Chris. updated the perl example to match the awk regex.

ADD REPLY • link 13.1 years ago by Aaronquinlan 12k

0

Entering edit mode

Btw the shortest awk syntax is actually:

awk '!/^>/' seq.fasta

Note that awk can take filename as an argument, but you could also use your < syntax.

ADD REPLY • link 3.5 years ago by jena ▴ 320

score 2 · Answer 2 · 2011-11-09

2

Entering edit mode

13.1 years ago

Pierre Lindenbaum 164k

awk '($0 ~ /^[^>]/)' < file.fasta

ADD COMMENT • link 13.1 years ago by Pierre Lindenbaum 164k

Ram · Answer 3 · 2011-11-09

1

Entering edit mode

13.1 years ago

User 4133 ▴ 150

You can also use this:

grep -v '>' file.fasta

In my blog you can find a comprehensive posto about formatting and splitting fasta files using python scripts:

http://basicbioinformatics.blogspot.com/2011/10/split-fasta-file.html

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.1 years ago by User 4133 ▴ 150

1

Entering edit mode

A minor point, but you really want to ensure that the > starts at the beginning of the line, per the FASTA spec.

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.1 years ago by Aaronquinlan 12k

0

Entering edit mode

Is there any valid fasta where this is a problem? I mean, if the > is in the middle of the header line, then the header line still gets captured properly. If it's in the middle of your sequence, then you have a bigger problem (i.e. file corruption) on your hands.

I have the same question about # in header of vcf files. I always filter just for # and so far no burns, but is there any realistic scenario where # can appear below header in a valid vcf?

ADD REPLY • link 3.5 years ago by jena ▴ 320

0

Entering edit mode

I guess I can imagine somebody could put # into CHROM, ID, or FILTER columns of a vcf - so maybe I can start doing the proper ^# thing. I still think that this is a non-issue for fasta though. You either capture header or you have corrupted sequence, neither is solved by searching ^>. On the contrary in fact, since you will not notice immediately that your seq is corrupted. My 2cents.

ADD REPLY • link 3.5 years ago by jena ▴ 320

0

Entering edit mode

i.e. grep -v "^>"

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.1 years ago by Neilfws 49k