How to extract fasta sequences and only its ID's, based on the subsequence fasta numbers from a main fasta file ?
1
2
Entering edit mode
3.7 years ago
sunnykevin97 ▴ 990

Hi

Extracting only first sequence from the multiline fasta file

perl -02 -ne '/(>[^>]+)/ && print $1' OG0012884.fa

>TRINITY_DN9_c0_g1_i1.p1
MPMKGRFPIRRTLEFLRSGTVVFKDSVKILTVNYNTHGERSDGARKFVFFNIPQIQYQNP
WIQILMFKNMTPSPFLRFYLDDGEQVLVDVEGKNHKQIVEHVKTILGKNDVLLEADKQVQ
KEHSHPAHFGPKTYCLRECMCEVGGQVPCPGVVPLPKEMTGKYWTALRAGSAI*

How do I extract only the 2nd and 9th sequence, ?

Using Bash able to print only 1st ID in a fasta file ?

head -1 OG0012884.fa
>TRINITY_DN9_c0_g1_i1.p1

How do i extract only the 2nd ID ?

Help is need, suggestions please.

fasta RNA DNA alignment • 1.1k views
ADD COMMENT
4
Entering edit mode
3.7 years ago
geneticatt ▴ 140

Here's an awk based solution which counts the headers as they come and prints the 2nd and 9th along with their sequence on the respective next lines.

awk '{if ($1 ~ "^>") count+=1} { if ( count == 2 || count == 9 ) { print $0; t = NR+1};} { if ( NR == t ) { print $0 }}'  OG0012884.fa

If you happen to be working with a file that contains the sequence on the same line:

awk '{if ($1 ~ "^>") count+=1} { if ( count == 2 || count == 9 ) print $0}' OG0012884.fa

Heres the pseudocode, if you care to follow along.

for line in file:
    if line starts with ">":
        count += 1
    if count == 2 or count == 3:
        print line
        t = current line number + 1
    if current line number = t:
        print line
ADD COMMENT
2
Entering edit mode
3.7 years ago

Please rephrase your questions. You have two questions in OP (copy/pasted):

  1. How do I extract only the 2nd and 9th sequence, ?
  2. How do i extract only the 2nd ID ?

Solution for first query:

$ awk -v RS=">" -v OFS="\n" 'NR>1{print ">"$1,$3,$10}' test.fa                                                                                                                       
>seq1
line2
line9
>seq2
sline2
sline9

Solution for second query :

$ awk -v RS=">" -v OFS="\n" 'NR==3{print ">"$0}' test.fa

>seq2
sline1
sline2
sline3
sline4
sline5
sline6
sline7
sline8
sline9

$ awk -v RS=">" -v OFS="\n" 'NR==3{print ">"$1}' test.fa     

>seq2

Bonus: If you want to print 2 and 9 lines from 2nd id:

$ awk -v RS=">" -v OFS="\n" 'NR==3{print ">"$1,$3,$10}' test.fa                                                                                                                      

>seq2
sline2
sline9

test.fa:

$ cat test.fa                                                                                                                                                                        
>seq1
line1
line2
line3
line4
line5
line6
line7
line8
line9
>seq2
sline1
sline2
sline3
sline4
sline5
sline6
sline7
sline8
sline9
ADD COMMENT

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6