Question

How to extract a sequence below a fasta header (">") from a fasta file ?

1

Entering edit mode

3.9 years ago

v.berriosfarias ▴ 140

Hello I have a lot of sequences in a FASTA file, and I want to extarct a specific sequence knowing the header ID. for example the header of a sequence is:

NODE_19_length_5758_cluster_19_candidate_1

I know that with grep I can extract the header, but i want the below sequences to appear on stdout.

How can I do this on bash?

fasta bash • 6.9k views

ADD COMMENT • link updated 3.9 years ago by ATpoint 85k • written 3.9 years ago by v.berriosfarias ▴ 140

0

Entering edit mode

$ seqkit grep -w 0 -ip NODE_19_length_5758_cluster_19_candidate_1 input.fa

for printing sequence only:

$ seqkit grep -ip NODE_19_length_5758_cluster_19_candidate_1 test.fa  | seqkit seq -s -w 0

ADD REPLY • link 3.9 years ago by cpad0112 21k

3

Entering edit mode

3.9 years ago

i.sudbery 20k

I'm not sure you can in pure bash, but you can with awk (I think).

If you make the record separator \n> rather than \n, and the field separator \n, I think you should be able to do:

awk -v RS="\n>" -v FS="\n" '$1=="NODE_19_length_5758_cluster_19_candidate_1" {print ">"$0}' my_fasta.fasta

ADD COMMENT • link 3.9 years ago by i.sudbery 20k

0

Entering edit mode

yes , this worked for me thanks so much :)

ADD REPLY • link 3.9 years ago by v.berriosfarias ▴ 140

0

Entering edit mode

can you explain the $0 and the $1 on this situation ? I'll be so grateful if you explain this

ADD REPLY • link 3.9 years ago by v.berriosfarias ▴ 140

1

Entering edit mode

In awk $X references "record X", this usually means "column X" when the record separator is something like space, tab or comma. So $1 is record/column 1. As we are using \n as a field separator, in this case it means "line 1" of the record.

$0 means the whole record.

ADD REPLY • link 3.9 years ago by i.sudbery 20k

0

Entering edit mode

Very interesting. I had never thought about using awk's record and separators like this to split files having grouped lines.

ADD REPLY • link 3.9 years ago by Jorge Amigo 14k

0

Entering edit mode

Just leaving a note about a problem with this:

Originally I set RS="\n>" this doesn't for the very first entry, because it starts with >, not \n. This means the record name of the first record will include >, while all the others exclude it.

You could set RS=">", but then if you have the symbol > anywhere else other than the start of a new record (e.g. if the title of a record has a mutaiton in it written as A>T) you'll get the records split in the wrong place.

As it is, I would just go with @ATPoints answer, which I had completely forgotten about when I wrote this.

ADD REPLY • link 3.9 years ago by i.sudbery 20k

0

Entering edit mode

another awk solution for flattend fasta file:

$ awk -v OFS="\n" '/^>/ {getline seq} {if ($0==">NODE_19_length_5758_cluster_19_candidate_1") print $0,seq}' test.fa

for printing sequence only:

$ awk -v OFS="\n" '/^>/ {getline seq} {if ($0==">NODE_19_length_5758_cluster_19_candidate_1") print seq}' test.fa

ADD REPLY • link 3.9 years ago by cpad0112 21k

1

Entering edit mode

3.9 years ago

jenn.drummond ▴ 100

The awk solution is more general, but for simplicity, if I have single-line rather than hard-wrapped FASTAs, I prefer to do this with grep -A and tail.

grep -wA 1 '>NODE_19_length_5758_cluster_19_candidate_1' example.fasta | tail -n 1

The -A 1 tells grep to return both the matched line and the line immediately after it, and then tail takes the last (second) line of that result. (The w is just for full-word matching in case you have similar sequence names that are subsets of each other.)

If you're sure that the entire sequence is on the next line, I find grep is easier to use in a parameterized loop than the awk version. I always get tangled up with the quoting. In fact, with the grep approach, you can even use a file to hold your list of desired headers. If you want both the headers and their sequences, just drop the second grep.

grep -wf list-of-headers.txt -A 1 example.fasta | grep -v '^>'

If your sequences are spread across multiple lines, then use one of the awk solutions, or "unwrap" your FASTA first with a tool like fastx_toolkit.

ADD COMMENT • link 3.9 years ago by jenn.drummond ▴ 100

1

Entering edit mode

3.9 years ago

Jorge Amigo 14k

I would definitely go for @ATpoint's samtools solution in general, as it should work always and it should be really fast.

If programming alternatives like @i.sudbery's awk solution in general or @jenn.drummond's grep if your fasta file is linearized would be considered, here is a perl alternative just for having where to choose:

perl -ne 'if (/^>NODE_19_length_5758_cluster_19_candidate_1/) { $o = 1; print; next } if ($o) { last if /^>/; print }' my_fasta.fasta

It should also work always, it allows you to use a sequence pattern rather than the entire sequence name (this could be convenient in certain cases), and it woud be faster than the first of samtools faidx. If any other sequences would be queried afterwards, having your fasta file already indexed would definitely be faster.

ADD COMMENT • link 3.9 years ago by Jorge Amigo 14k

score 11 · Accepted Answer · 2021-01-10

The proposed solutions are probably all fine but have the limitation that they first have to iteratively find the correct sequence which can take time if the file is (very) large. The by far simplest solution would be to use samtools faidx which in a first step indexes your fasta file and then can use this index to retrieve any sequence in basically no time. The index generation takes like 10 seconds for the entire mouse genome and is therefore not really a limitation, and it only has to be done once per fasta file:

samtools faidx your.fasta NODE_19_length_5758_cluster_19_candidate_1

For many sequences manually:

samtools faidx your.fasta seq1 seq2 seq(...N)

or if you want it automated

(upstream cmd creating "\n"-sep list of seq.names) | xargs samtools faidx your.fasta

Please be sure to browse existing threads, this has been asked many times before.

How To Extract A Sequence From A Big (6Gb) Multifasta File ?

How do I extract Fasta Sequences based on a list of IDs?