awk? how to retrieve info from massive Primer3 output file
2
0
Entering edit mode
8.8 years ago
sp ▴ 20

Hi All,

I would like to make primer file in fasta format retrieving some information (Primer ID, and left & right sequence) from primer3 output as below. What would be the most efficient way to do this job? I figured that "AWK" might be a good tool for this, but no clue where to start. Mucho appreciate if any expert can help me!!!

from primer3 out:

{'PRIMER_INTERNAL_EXPLAIN': 'considered 3622, unacceptable product size 3492, ok 130',
'PRIMER_INTERNAL_NUM_RETURNED': 0L,
'PRIMER_LEFT_0': (124L, 19L),
'PRIMER_LEFT_0_END_STABILITY': 3.58,
'PRIMER_LEFT_0_GC_PERCENT': 52.63157894736842,
'PRIMER_LEFT_0_HAIRPIN_TH': 44.35240092833908,
'PRIMER_LEFT_0_PENALTY': 80.38508071355189,
'PRIMER_LEFT_0_SELF_ANY_TH': 0.0,
'PRIMER_LEFT_0_SELF_END_TH': 0.0,
'PRIMER_LEFT_0_SEQUENCE': '**CCTCAAGGTCTTCCTGTCA**',
'PRIMER_LEFT_0_TM': 55.953805840088194,
'PRIMER_LEFT_EXPLAIN': 'considered 598, overlap excluded region 64, GC content failed 278, low tm 14, high tm 155, high hairpin stability 33, long poly-x seq 16, ok 38',
'PRIMER_LEFT_NUM_RETURNED': 1L,
'PRIMER_PAIR_0_COMPL_ANY_TH': 0.0,
'PRIMER_PAIR_0_COMPL_END_TH': 0.0,
'PRIMER_PAIR_0_PENALTY': 789.1692592626196,
'PRIMER_PAIR_0_PRODUCT_SIZE': 76L,
'PRIMER_PAIR_NUM_RETURNED': 1L,
'PRIMER_RIGHT_0': (199L, 18L),
'PRIMER_RIGHT_0_END_STABILITY': 4.57,
'PRIMER_RIGHT_0_GC_PERCENT': 55.55555555555556,
'PRIMER_RIGHT_0_HAIRPIN_TH': 0.0,
'PRIMER_RIGHT_0_PENALTY': 108.78417854906772,
'PRIMER_RIGHT_0_SELF_ANY_TH': 10.083235050216501,
'PRIMER_RIGHT_0_SELF_END_TH': 0.0,
'PRIMER_RIGHT_0_SEQUENCE': '**GAAGTATACGCGGGCACA**',
'PRIMER_RIGHT_0_TM': 57.180918004325235,
'PRIMER_RIGHT_EXPLAIN': 'considered 658, overlap excluded region 63, GC content failed 314, low tm 14, high tm 179, high hairpin stability 20, long poly-x seq 3, ok 65',
'PRIMER_RIGHT_NUM_RETURNED': 1L,
'SEQUENCE_ID': '**chr10:43102245-43102445**',
'SEQUENCE_TEMPLATE': 'GGGTTTACACCAGCCCTGGAGCTCCTGCCTCCTCCCCATTCCCGACTGCCTGGCAGATGTGGCCGATGCCCCCACAGACCTGACTTCTCTCTGCAGACCGCGGCTTTCCCCTGCTCACCGTCTACCTCAAGGTCTTCCTGTCACCCACATCCCTTCGTGAGGGCGAGTGCCAGTGGCCAGGCTGTGCCCGCGTATACTTC'}
None

{'PRIMER_INTERNAL_EXPLAIN': 'considered 4710, unacceptable product size 2919, ok 1791',

'PRIMER_INTERNAL_NUM_RETURNED': 0L,
'PRIMER_LEFT_0': (52L, 18L),
'PRIMER_LEFT_0_END_STABILITY': 3.41,
'PRIMER_LEFT_0_GC_PERCENT': 61.111111111111114,
'PRIMER_LEFT_0_HAIRPIN_TH': 0.0,
'PRIMER_LEFT_0_PENALTY': 165.23988901926742,
'PRIMER_LEFT_0_SELF_ANY_TH': 0.0,
'PRIMER_LEFT_0_SELF_END_TH': 0.0,
'PRIMER_LEFT_0_SEQUENCE': '**CCATCTCGCCTGCACTGA**',
'PRIMER_LEFT_0_TM': 59.41918527210419,
'PRIMER_LEFT_EXPLAIN': 'considered 509, overlap excluded region 64, GC content failed 174, low tm 20, high tm 132, long poly-x seq 30, ok 89',
'PRIMER_LEFT_NUM_RETURNED': 1L,
'PRIMER_PAIR_0_COMPL_ANY_TH': 9.428746428806335,
'PRIMER_PAIR_0_COMPL_END_TH': 6.6828990446157945,
'PRIMER_PAIR_0_PENALTY': 367.4410290971364,
'PRIMER_PAIR_0_PRODUCT_SIZE': 96L,
'PRIMER_PAIR_NUM_RETURNED': 1L,
'PRIMER_RIGHT_0': (147L, 20L),
'PRIMER_RIGHT_0_END_STABILITY': 4.18,
'PRIMER_RIGHT_0_GC_PERCENT': 55.0,
'PRIMER_RIGHT_0_HAIRPIN_TH': 0.0,
'PRIMER_RIGHT_0_PENALTY': 102.20114007786896,
'PRIMER_RIGHT_0_SELF_ANY_TH': 21.29971050510221,
'PRIMER_RIGHT_0_SELF_END_TH': 3.95513181391766,
'PRIMER_RIGHT_0_SEQUENCE': '**CTGATGCAGGTACCACGTCT**',
'PRIMER_RIGHT_0_TM': 59.467426718579304,
'PRIMER_RIGHT_EXPLAIN': 'considered 562, overlap excluded region 64, GC content failed 239, high tm 175, high hairpin stability 12, long poly-x seq 24, ok 48',
'PRIMER_RIGHT_NUM_RETURNED': 1L,
'SEQUENCE_ID': '**chr10:43106282-43106482**',
'SEQUENCE_TEMPLATE': 'GTGTGGGACGTGCAGCATTCTAAGGTCTCTGGTTTTGGGGGGTCTGAGGGGCCCATCTCGCCTGCACTGACCAACGCCCTCTGCATCCTGCAGGACACCGTGGTGGCCACGCTGCGTGTCTTCGATGCAGACGTGGTACCTGCATCAGGGGAGCTGGTGAGGCGGTACACAAGCACGCTGCTCCCCGGGGACACCTGGGC'}
None

and so on...

To desired primer format:

>chr10:43102245-43102445-left
CCTCAAGGTCTTCCTGTCA
>chr10:43102245-43102445-right
GAAGTATACGCGGGCACA
>chr10:43106282-43106482-left
CCATCTCGCCTGCACTGA
>chr10:43106282-43106482-right
CTGATGCAGGTACCACGTCT

etc...

sequence shell scripting • 2.5k views
ADD COMMENT
0
Entering edit mode

I ran your script just now, and encountered error as below.

sp@sp-VBox:~/shared$ grep -E "PRIMER_RIGHT_0_SEQUENCE|PRIMER_LEFT_0_SEQUENCE|SEQUENCE_ID" **chr10_Primers_Full.out** | paste - - - | awk '{ gsub("\047|,","",$0); print ">"$6"-left\n"$2"\n" ">"$6"-right\n"$4}'​ **> output.fa**

awk: cmd. line:1: { gsub("\047|,","",$0); print ">"$6"-left\n"$2"\n" ">"$6"-right\n"$4}​
awk: cmd. line:1:                                                                      ^ invalid char '�' in expression

Any suggestion?

ADD REPLY
0
Entering edit mode

using windows ?

ADD REPLY
0
Entering edit mode

Copied your code from windows and pasted it virtual ubuntu machine.

ADD REPLY
4
Entering edit mode
8.8 years ago
grep -E "PRIMER_RIGHT_0_SEQUENCE|PRIMER_LEFT_0_SEQUENCE|SEQUENCE_ID" test.fasta |\
     paste - - - |\
     awk '{ gsub("\047|,","",$0); print `">"$6"-left\n"$2"\n"` ">"$6"-right\n"$4}'​

Assumptions:

  1. The right,left sequences and sequence ID follows same order for every record/primer.
  2. The left and right sequence is always present in a single line.

It could be more elegant.

ADD COMMENT
0
Entering edit mode

I typed your script on the terminal directly and it worked nicely.

Mucho gracias!

ADD REPLY
1
Entering edit mode
8.8 years ago
Charles Plessy ★ 2.9k

For a simpler output format, have a look at the eprimer32 wrapper in EMBOSS.

ADD COMMENT

Login before adding your answer.

Traffic: 1842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6