Question

how to extract multiple fasta sequences from a file

0

Entering edit mode

7.5 years ago

HZZ0036 ▴ 30

Hi, I have a file includes both fasta sequences and non fasta sequences, like this；

 454 -      PolA   2436284                1.88
 454 -    1 CDSl   2436471 -   2436637   17.09   2436471 -   2436635    165
 454 -    2 CDSf   2436688 -   2436928   18.36   2436689 -   2436928    240
 454 -      TSS    2437349               -1.10
enter code here
 455 +      TSS    2439215                5.09
 455 +    1 CDSf   2439438 -   2439570   13.30   2439438 -   2439569    132

Predicted protein(s):
>FGENESH:   1   3 exon (s)     37  -   4224   154 aa, chain +
MCLADYAIICHREGTLHEVV``DPIIRDQIAPQCLRKFAEMTEQCVNEVGTTGGASALRAPG
AGPEAEAREKMCLADYAIICHREGTLHEAVDPIIRDQTRRNASGNSLRRQSNLINGHEAY
TTTARTHETRVVEETGDELANSAAFSQLVRPIGR
>FGENESH:   2   5 exon (s)   5130  -   6247   229 aa, chain -
MAPCQDIVDEGWGWERLVPCRFDGCVKWPDFKRYLVHYYHKNADKKVGELVGMRKPYPVE
QPDGATDDSLHAIVNQCIEAEYRFIRTCREKFTIDDFLLSRDITDRAKQLLQSGCESSIA
TVALLCITKEDELLCELFACQDISKALAFANVIRRSASNLMLFKGSESDAAGGGIMLGLA
REAEVALLAMHSGDEYAIANYITAVDARMRVPWCRCPVAMTTVSEVAAM

How to extract all fasta sequences? I want to get like a file only includes this:

>FGENESH:   1   3 exon (s)     37  -   4224   154 aa, chain +
MCLADYAIICHREGTLHEVVDPIIRDQIAPQCLRKFAEMTEQCVNEVGTTGGASALRAPG
AGPEAEAREKMCLADYAIICHREGTLHEAVDPIIRDQTRRNASGNSLRRQSNLINGHEAY
TTTARTHETRVVEETGDELANSAAFSQLVRPIGR
>FGENESH:   2   5 exon (s)   5130  -   6247   229 aa, chain -
MAPCQDIVDEGWGWERLVPCRFDGCVKWPDFKRYLVHYYHKNADKKVGELVGMRKPYPVE
QPDGATDDSLHAIVNQCIEAEYRFIRTCREKFTIDDFLLSRDITDRAKQLLQSGCESSIA
TVALLCITKEDELLCELFACQDISKALAFANVIRRSASNLMLFKGSESDAAGGGIMLGLA
REAEVALLAMHSGDEYAIANYITAVDARMRVPWCRCPVAMTTVSEVAAM

Thanks in advance.

sequence • 1.4k views

ADD COMMENT • link updated 7.5 years ago by Joe 22k • written 7.5 years ago by HZZ0036 ▴ 30

1

Entering edit mode

Can you reformat your post? The site interprets > as the beginning of a quotation, so enclose your sequence information in code form using the button with 101010 on it

ADD REPLY • link 7.5 years ago by Joe 22k

0

Entering edit mode

Are there always the same amount of header lines before the sequences start?

ADD REPLY • link 7.5 years ago by Joe 22k

score 1 · Answer 1 · 2017-07-06

1

Entering edit mode

7.5 years ago

Pierre Lindenbaum 164k

start writing after first '>'

awk '/^>/ {f=1;} {if(f==1) print;}'  file.fa

ADD COMMENT • link 7.5 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2017-07-06

0

Entering edit mode

7.5 years ago

venu 7.1k

Seems you have a space at the beginning of lines that are starting with digits. Following should work

grep -v '^\s' file.fa | sed -e '/enter/d' -e '/^$/d' -e '/Predicted/d'

If those lines are not starting with space, just replace '^\s' with '^[0-9]'.

ADD COMMENT • link 7.5 years ago by venu 7.1k

0

Entering edit mode

It worked, but there are some lines include "//", like:

>FGENESH: 199   5 exon (s) 1515013  - 1519188   619 aa, chain -
LRGSLGLRARDWPARSDPCSAWTGVTCRAGRVVALTVAGLRRTRRASLAPRLALDGLRNL
TALERFNASGFPLPGEIPAWFASGSGLPPPLAVLDLTSAGVNGTLPAGLGAASGNLTTLL
//
>FGENESH:   1   1 exon (s)   4483  -   4881   132 aa, chain +
MEEQHGGGRASNKIRDIVRLQQLLKKWKKLATVAPSSSSGKSSSVPRGSFAVYVGDEMRR
FVIPTEYLGHWAFAELLREAEEEFGFRHEGALRIPCDVEVFEGILRVVQGRKKDATDMCR
HSCSSETEILCR
......

How to delete '//' lines and change fasta file name using numbers? Like:

    >1
    LRGSLGLRARDWPARSDPCSAWTGVTCRAGRVVALTVAGLRRTRRASLAPRLALDGLRNL
    TALERFNASGFPLPGEIPAWFASGSGLPPPLAVLDLTSAGVNGTLPAGLGAASGNLTTL
    >2
    MEEQHGGGRASNKIRDIVRLQQLLKKWKKLATVAPSSSSGKSSSVPRGSFAVYVGDEMRR
    FVIPTEYLGHWAFAELLREAEEEFGFRHEGALRIPCDVEVFEGILRVVQGRKKDATDMCR
    HSCSSETEILCR
    .....

ADD REPLY • link 7.5 years ago by HZZ0036 ▴ 30

0

Entering edit mode

Renaming fasta headers is probably the most asked question on the forum, so search around a bit before asking a new question; which is what the second part of the comment is really... (best way to learn is by doing it yourself anyway!)

ADD REPLY • link 7.5 years ago by Joe 22k

score 0 · Answer 3 · 2017-07-06

0

Entering edit mode

7.5 years ago

Joe 22k

This works on the test data, not sure whether other files would catch it out:

sed -n '/>/,/(>|\n)/p' testfile.txt

Prints everything between a > and either another > (so the next fasta) or a newline (to make sure the last fasta in the file is included)

ADD COMMENT • link 7.5 years ago by Joe 22k