Question

Extract N amino acids from fasta file

0

Entering edit mode

5.7 years ago

martha.chapa.mc18 • 0

Hi, I want to extract the first N aminoacids from sequences in a fasta file. I have this sequences,

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQTATVIDWDQIREASQTQRRQAAAIANAPVK
QGVVHEPIDAGVMAGNVPAEQRNAASIVQSIDGSKLSQISDRLPKFIKQGSDEVVYGKHV
VVSKLGPEVIGLILDLIKAQPANRALLLAKLQAISNDGNPEASNFMGFVFEYGLFGAVKN

for example, I want this sequence with only 30 aa, like:

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQ

Is there a program that can do this to all sequences in linux terminal? I hope you can help me. Thank you.

fasta sequence • 2.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 5.7 years ago by martha.chapa.mc18 • 0

0

Entering edit mode

You could convert to tabular format with seqkit and use the substring function from awk:

seqkit fx2tab file.fasta  | awk -v FS="\t" '{print ">"$1"\n"substr($2,1,30)}'

ADD REPLY • link 5.7 years ago by alex.zaccaron ▴ 480

1

Entering edit mode

seqkit subseq -r 1:20 is enough.

ADD REPLY • link 5.7 years ago by shenwei356 8.7k

0

Entering edit mode

Be careful! This approach makes a lot of assumptions about the structure of the FASTA file.

ADD REPLY • link 5.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Yes, it does. Sorry, I thought the input file was tabular format. I updated the comment.

ADD REPLY • link 5.7 years ago by alex.zaccaron ▴ 480

score 1 · Answer 1 · 2019-12-09

1

Entering edit mode

5.7 years ago

finswimmer 16k

Using seqkit:

$ seqkit subseq -r 1:30 input.fasta

ADD COMMENT • link 5.7 years ago by finswimmer 16k

score 0 · Answer 2 · 2019-12-09

awk '{if(/>.*/) {print} else {print substr($0, 1, 30)} }' test.fa

test.fa

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQTATVIDWDQIREASQTQRRQAAAIANAPVKQGVVHEPIDAGVMAGNVPAEQRNAASIVQSIDGSKLSQISDRLPKFIKQGSDEVVYGKHVVVSKLGPEVIGLILDLIKAQPANRALLLAKLQAISNDGNPEASNFMGFVFEYGLFGAVKN

output

>a47619p2-
MVKIALFGRNITLPILIFIGFVFLHDASAQ

score 0 · Answer 3 · 2019-12-10

With biopython:

#Usage: python3 scriptname.py file.fasta
import sys
from Bio import SeqIO

for i in SeqIO.parse(sys.argv[1], "fasta"):
    print(f">{i.description}\n{i.seq[0:30]}")

Or as a one-liner:

$ python3 -c 'import sys; from Bio import SeqIO; [print(f">{i.description}\n{i.seq[0:30]}") for i in SeqIO.parse(sys.argv[1], "fasta")];' file.fasta

Replace [0:30] with whatever range you like (it doesn't have to start at zero either).