Question

How to extract fasta sequences and only its ID's, based on the subsequence fasta numbers from a main fasta file ?

2

Entering edit mode

4.4 years ago

sunnykevin97 ▴ 1000

Hi

Extracting only first sequence from the multiline fasta file

perl -02 -ne '/(>[^>]+)/ && print $1' OG0012884.fa

>TRINITY_DN9_c0_g1_i1.p1
MPMKGRFPIRRTLEFLRSGTVVFKDSVKILTVNYNTHGERSDGARKFVFFNIPQIQYQNP
WIQILMFKNMTPSPFLRFYLDDGEQVLVDVEGKNHKQIVEHVKTILGKNDVLLEADKQVQ
KEHSHPAHFGPKTYCLRECMCEVGGQVPCPGVVPLPKEMTGKYWTALRAGSAI*

How do I extract only the 2nd and 9th sequence, ?

Using Bash able to print only 1st ID in a fasta file ?

head -1 OG0012884.fa
>TRINITY_DN9_c0_g1_i1.p1

How do i extract only the 2nd ID ?

Help is need, suggestions please.

fasta RNA DNA alignment • 1.3k views

ADD COMMENT • link updated 4.4 years ago by cpad0112 21k • written 4.4 years ago by sunnykevin97 ▴ 1000

2

Entering edit mode

4.4 years ago

cpad0112 21k

Please rephrase your questions. You have two questions in OP (copy/pasted):

How do I extract only the 2nd and 9th sequence, ?
How do i extract only the 2nd ID ?

Solution for first query:

$ awk -v RS=">" -v OFS="\n" 'NR>1{print ">"$1,$3,$10}' test.fa                                                                                                                       
>seq1
line2
line9
>seq2
sline2
sline9

Solution for second query :

$ awk -v RS=">" -v OFS="\n" 'NR==3{print ">"$0}' test.fa

>seq2
sline1
sline2
sline3
sline4
sline5
sline6
sline7
sline8
sline9

$ awk -v RS=">" -v OFS="\n" 'NR==3{print ">"$1}' test.fa     

>seq2

Bonus: If you want to print 2 and 9 lines from 2nd id:

$ awk -v RS=">" -v OFS="\n" 'NR==3{print ">"$1,$3,$10}' test.fa                                                                                                                      

>seq2
sline2
sline9

test.fa:

$ cat test.fa                                                                                                                                                                        
>seq1
line1
line2
line3
line4
line5
line6
line7
line8
line9
>seq2
sline1
sline2
sline3
sline4
sline5
sline6
sline7
sline8
sline9

ADD COMMENT • link 4.4 years ago by cpad0112 21k

score 4 · Accepted Answer · 2021-04-09

Here's an awk based solution which counts the headers as they come and prints the 2nd and 9th along with their sequence on the respective next lines.

awk '{if ($1 ~ "^>") count+=1} { if ( count == 2 || count == 9 ) { print $0; t = NR+1};} { if ( NR == t ) { print $0 }}'  OG0012884.fa

If you happen to be working with a file that contains the sequence on the same line:

awk '{if ($1 ~ "^>") count+=1} { if ( count == 2 || count == 9 ) print $0}' OG0012884.fa

Heres the pseudocode, if you care to follow along.

for line in file:
    if line starts with ">":
        count += 1
    if count == 2 or count == 3:
        print line
        t = current line number + 1
    if current line number = t:
        print line