sed command to extract sequence
1
0
Entering edit mode
5.1 years ago

Hi,

I have a fasta file, like this:

>TRINITY_DN100000_c1_g1::TRINITY_DN100000_c1_g1_i3::g.3039::m.3039 TRINITY_DN100000_c1_g1::TRINITY_DN100000_c1_g1_i3::g.3039  ORF type:complete len:100 (-) TRINITY_DN100000_c1_g1_i3:1027-1326(-)
MVWIKFRGLHRVLTSTPLVKSGKTPSQTWAFLDISVELIVFLFLNVHKSPMPHFKIYSEA
FSEEWSLLWLQYSRHLIQKPKPWQIKIELLHLCCCNRLC*
>TRINITY_DN100000_c1_g6::TRINITY_DN100000_c1_g6_i2::g.84365::m.84365 TRINITY_DN100000_c1_g6::TRINITY_DN100000_c1_g6_i2::g.84365  ORF type:complete len:112 (-) TRINITY_DN100000_c1_g6_i2:379-714(-)
MEMMQEIIPFAREMLSARPSKGTMKVYLVGGTFAVLGIVSGMVEAACSLFPEQEESTLTK
LMEDCLTVTAQNQEPQTFIPEDDEQDAEMEAKAKDLPMFRQRRMSFRAHAS*

if I want to only keep the second header, like this (the amino acid sequence keep unchanged), how should I correct this command sed 's/::.*//' input > output:

>TRINITY_DN100000_c1_g1_i3
MVWIKFRGLHRVLTSTPLVKSGKTPSQTWAFLDISVELIVFLFLNVHKSPMPHFKIYSEA
FSEEWSLLWLQYSRHLIQKPKPWQIKIELLHLCCCNRLC*
>TRINITY_DN100000_c1_g6_i2
MEMMQEIIPFAREMLSARPSKGTMKVYLVGGTFAVLGIVSGMVEAACSLFPEQEESTLTK
LMEDCLTVTAQNQEPQTFIPEDDEQDAEMEAKAKDLPMFRQRRMSFRAHAS*

this command can only keep the first header >TRINITY_DN100000_c1_g1 if I want to keep the second header with the isoform information TRINITY_DN100000_c1_g1_i3, how should I correct this command?

RNA-Seq • 1.1k views
ADD COMMENT
2
Entering edit mode
5.1 years ago
patelk26 ▴ 320

Try this: sed 's/::g.*(-)//' input.fasta | sed 's/>.*::/>/' > output.fasta

ADD COMMENT
0
Entering edit mode

Hi thanks! this command will give this:

    >g.122404  ORF type:internal len:253 (+) TRINITY_DN100000_c0_g1_i1:3-758(+)
    GIELKRFDMSEYMERHAVSRLIGAPPGYVGYEQGGLLTEAISKKPHCVLLLDEIEKAHPD
    IYNVLLQVMDHGTLTDNNGRKADFRNVIIIMTTNAGAETMNKATIGFTNPRQAGDEMGDI
    KRLFTPEFRNRLDAIVSFKPLDEQIILRVVDKFLLQLETQLAEKKVEVTFTDALRKHLAK
    KGFDPLMGARPMQRLIQEMIRKALADELLFGRLTEGGRLNVDLDDKGEVQLDIQPLPKKE
    ARSGKSDEPLLS
    >TRINITY_DN100000_c1_g1_i1
    LLTEAVTKKPHCVLLLDEIEKAHPDIFNVLLQVMDHGTLTDNNGRKADFRNVIIIMTTNA
    GAETMNKSTIGFTTQRQAGDEMADIKRLFTPEFRNRLDAI

Some of the sequences look good, but some still look messed up...if you know how to modify the command further..?

Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1811 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6