Extract Sequences From Fasta Using Awk One-Liner
3
4
Entering edit mode
13.1 years ago
Newvin ▴ 360

Hi all. Really basic question here. I'd like to grab the sequences from a FASTA file with an AWK one-liner. To grab the headers, I can do:

awk < seq.fasta '/^>/ { print $0 }'

How do I negate this, so that it grabs the lines that do NOT begin with the '>' character. Feel free to chime in with other methods to solve the problem, but I'd like to learn an AWK-specific solution as I am trying to level up my AWK.

Thanks!

parsing fasta awk sequence • 18k views
ADD COMMENT
0
Entering edit mode

IMHO, you really want to "level up" in regular expressions, not awk specfically. The more experience you develop with regex, you'll be able to apply it to awk, sed, and grep (as well as most programming languages) equally well.

ADD REPLY
8
Entering edit mode
13.1 years ago

awk < seq.fasta '!/^>/ { print $0 }'

or (preferred for clarity):

awk < seq.fasta '$0 !~ /^>/ { print $0 }'

or merely:

awk < seq.fasta '$0 !~ /^>/'

or grep

grep -v ^\> seq.fasta

or some people prefer "perl one liners" for this sort of thing because you can conceivably use Perl for awk-ish filters and for your day to day scripting.

perl -lne 'print if !($_ =~ /^\>/)' seq.fasta
ADD COMMENT
3
Entering edit mode

Thanks. I must have missed that in that documentation. How AWK-ward.

ADD REPLY
1
Entering edit mode

Perl line is not quite right. Your command will print all lines that don't have '>' anywhere. To print just those lines that don't start with '>':

perl -lne 'print if !($_ =~ /^>/)' seq.fasta

ADD REPLY
0
Entering edit mode

right, thanks Chris. updated the perl example to match the awk regex.

ADD REPLY
0
Entering edit mode

Btw the shortest awk syntax is actually:

awk '!/^>/' seq.fasta

Note that awk can take filename as an argument, but you could also use your < syntax.

ADD REPLY
2
Entering edit mode
13.1 years ago

awk '($0 ~ /^[^>]/)' < file.fasta

ADD COMMENT
1
Entering edit mode
13.1 years ago
User 4133 ▴ 150

You can also use this:

grep -v '>' file.fasta

In my blog you can find a comprehensive posto about formatting and splitting fasta files using python scripts:

http://basicbioinformatics.blogspot.com/2011/10/split-fasta-file.html

ADD COMMENT
1
Entering edit mode

A minor point, but you really want to ensure that the > starts at the beginning of the line, per the FASTA spec.

ADD REPLY
0
Entering edit mode

Is there any valid fasta where this is a problem? I mean, if the > is in the middle of the header line, then the header line still gets captured properly. If it's in the middle of your sequence, then you have a bigger problem (i.e. file corruption) on your hands.

I have the same question about # in header of vcf files. I always filter just for # and so far no burns, but is there any realistic scenario where # can appear below header in a valid vcf?

ADD REPLY
0
Entering edit mode

I guess I can imagine somebody could put # into CHROM, ID, or FILTER columns of a vcf - so maybe I can start doing the proper ^# thing. I still think that this is a non-issue for fasta though. You either capture header or you have corrupted sequence, neither is solved by searching ^>. On the contrary in fact, since you will not notice immediately that your seq is corrupted. My 2cents.

ADD REPLY
0
Entering edit mode

i.e. grep -v "^>"

ADD REPLY

Login before adding your answer.

Traffic: 1609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6