Question

script for collecting selected IDs prediction result from Transdecoder .pep file

0

Entering edit mode

8.5 years ago

Farbod ★ 3.4k

Dear Friends, Hi (I am not native in English, So be ready for some . . . )

I have a Transdecoder output .pep file which the head of it is as follow :

>TRINITY_DN100001_c0_g2::TRINITY_DN100001_c0_g2_i1::g.52916::m.52916 TRINITY_DN100001_c0_g2::TRINITY_DN100001_c0_g2_i1::g.52916  ORF type:3prime_partial len:142 (+) TRINITY_DN100001_c0_g2_i1:180-602(+)
MEEGEQLQLNRGVRHSQDRCSGEQIKTRAVRATPSTLSSTSRGINLKTFWHKGATGTTVK
IVLQEKHRRACVYSGKTYSHGEVWHPVLRPHRLLECILCTCKDGKQECRKITCPSEYPCQ
YPEKPEGKCCKTCPETKEETN
>TRINITY_DN100001_c0_g2::TRINITY_DN100001_c0_g2_i2::g.52918::m.52918 TRINITY_DN100001_c0_g2::TRINITY_DN100001_c0_g2_i2::g.52918  ORF type:complete len:324 (+) TRINITY_DN100001_c0_g2_i2:252-1223(+)
MKHLLFFFSFFLYFTSEAEAPRPRKTLETFCTFKEKRYNPGDSWHPYLEPHGFMFCIRCT
CAETGHVNCNSIKCPVLQCENPVIDSQQCCPRCAAEPKSPVGLRAPLKSCQYNGTIYQAG
EMFTSDELFPSRQPNQCVLCSCSNGNIFCGLRTCLKLTCSTPVSVPDTCCQLCKDHSDSP
ANPKYASMEEGEQLQLNRGVRHSQDRCSGEQIKTRAVRATPSTLSSTSRGINLKTFWHKG
ATGTTVKIVLQEKHRRACVYSGKTYSHGEVWHPVLRPHRLLECILCTCKDGKQECRKITC
PSEYPCQYPEKPEGKCCKTCPGM*

and I have a txt file containing the list of Trinity IDs I want to collect their results from that original .pep file I have mentioned above.

I need the collect all the line related to each IDs I have put in my list (as you can see each IDs has been repeated several times in its result lines, and the number of lines are different with each IDs and the end of the results are different - sometimes it ends to * and sometime does not).

Would you please help me about a command line/program script that can accept a list.txt file for this purpose?

~ Best

sequence awk bash script python • 2.0k views

ADD COMMENT • link 8.5 years ago by Farbod ★ 3.4k

1

Entering edit mode

You should also give us an example of the IDs file so we can be more confident in the response. You can whip up a bioawk script that does this. Or, modify this so the comparison is a regex instead of an exists.

ADD REPLY • link 8.5 years ago by Ram 45k

0

Entering edit mode

Dear Ram, hi and Thanks, It would be as :

TRINITY_DN113064_c0_g1_i1
TRINITY_DN47896_c0_g1_i1
TRINITY_DN47896_c1_g1_i1
TRINITY_DN77862_c1_g2_i1
TRINITY_DN107683_c4_g2_i1

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

0

Entering edit mode

Like I said in my previous comment, you're going to need to use bioawk or modify my script. Just make sure your regular expression accommodates for a prefix, the :: separators and a suffix.

ADD REPLY • link 8.5 years ago by Ram 45k

score 1 · Answer 1 · 2016-11-08

1

Entering edit mode

8.5 years ago

Prasad ★ 1.6k

You can convert your multi line fasta to single line fasta (see the code). Then use the simple grep command to get the result.

as for * is concerned, it is just that particular translation has stopped because of one of the stop codon (other way is, it is a complete protein)

ADD COMMENT • link 8.5 years ago by Prasad ★ 1.6k