Question

Extract out fasta sequences using sequence headers.

1

Entering edit mode

8.2 years ago

a.rex ▴ 350

I usually extract out fasta sequences using samtools:

i.e to extract the sequences for gene 000001

samtools faidx /path/to/transcriptome 000001

However, I was wondering whether there was a better method for extracting isoform sequences. I have tried the following command, but to no avail, to extract the sequences for gene isoforms 000001.1, 000001.2, 000001.3:

samtools faidx /path/to/transcriptome 000001.*

Does anyone have any tips on how to do this effectively?

gene • 1.8k views

ADD COMMENT • link updated 8.2 years ago by Jake Warner ▴ 840 • written 8.2 years ago by a.rex ▴ 350

1

Entering edit mode

BBMap's filterbyname tool will work like this:

filterbyname.sh in=transcriptome.fa out=filtered.fa include names=000001. substring=name

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2017-05-09

1

Entering edit mode

8.2 years ago

Jake Warner ▴ 840

You can do this with Awk :

awk '/'000001.*'/{flag=1;print $0;next}/^>/{flag=0}flag' file.fasta >> outfile.fasta

ADD COMMENT • link 8.2 years ago by Jake Warner ▴ 840