Parse gene name and joining multiple exon to make single exon
1
0
Entering edit mode
6.9 years ago
1769mkc ★ 1.2k

I have taken out exon sequences from genome file using the an awk script I get output like this as small subset ,now I have like 12 PDCD4 exon ,so in the file i might have single or multiple exon of all the genes ,now the next step is I have to make a single exon using the multiple exon like a single PDCD4 , then i have to join the 5 prime end of the gene to the 3 prime end of the complete sequence removing the middle part ..so I join the first PDCD4 to the last PDCD4 where

How do i do that... So first one is find the common exon if its its single exon then no issue Then if there are multiple exon then i have to join those exon into a common exon that multiple exon into a single exon Next I have to join the 5 prime end to the 3 prime end removing the middle part of the single exon sequences .

I would be glad if i get some help how to proceed

>PDCD4
CTTTTCCTCCTCAGCTCCGGCTCCGCCGCCACGATTGGCCAGCCGACCACCCGGCCTCGGCCAATAAGCGCCGCCCTCTCGCCCCCGTGTTACTGGGTAGAAGAAAACAAAAACAAACAGAGCGAGAAGGGCCAGAGACTCTCCGAGGCGGCGGCAGAGACAGAAGAGCGGGGTCGGGGCCGGCTGACCAGGAACCTGGGCGAGCAGCGGCGGGGGCCCGAGGG
>PDCD4
ATTCTGAAGGAAGATTTCCATTAGGTAATTTGTTTAATCAGTGCAAGCGAAATTAAGGGAAAATGGATGTAGAAAATGAGCAGATACTGAATGTAAACCCTGCAG
>PDCD4
GGTATTTTCCCTAATTCTCCATGGTGCTTCAATAGCATGTTATTATCATAAAAATGAACAGTTTTGTGGAATAGATGACCAAAT
>PDCD4
ATCCTGATAACTTAAGTGACTCTCTCTTTTCCGGTGATGAAGAAAATGCTGGGACTGAGGAAATAAAGAATGAAATAAATGGAAATTGGATTTCAGCATCCTCCATTAACGAAGCTAGAATTAATGCCAAGGCAAAAAGGCGACTAAGGAAAAACTCATCCCGGGACTCTGGCAGAGGCGATTCGGTCAGCGACAGTGGGAGTGACGCCCTTAGAAGTGGATTAACTGTGCCAACCAGTCCAAAGGGAAGGTTGCTGGATAGGCGATCCAGATCTGGGAAAGGAAGGGGACTACCAAAGAAAG
>PDCD4
GTGGTGCAGGAGGCAAAGGTGTCTGGGGTACACCTGGACAGGTGTATGATGTGGAGGAGGTGGATGTGAAAGATCCTAACTATGATGATGACCAG
>PDCD4
GAGAACTGTGTTTATGAAACTGTAGTTTTGCCTTTGGATGAAAGGGCATTTGAGAAGACTTTAACACCAATCATACAGGAATATTTTGAGCATGGAGATACTAATGAAGTTGCG
>PDCD4
GAAATGTTAAGAGATTTAAATCTTGGTGAAATGAAAAGTGGAGTACCAGTGTTGGCAGTATCCTTAGCATTGGAGGGGAAGGCTAGTCATAGAGAGATGACATCTAAGCTTCTTTCTGACCTTTGTGGGACAGTAATGAGCACAACTGATGTGGAAAAATCATTTGATAAATTGTTGAAAGATCTACCTGAATTAGCACTGGATACTCCTAGAGCACCACAG
>PDCD4
TTGGTGGGCCAGTTTATTGCTAGAGCTGTTGGAGATGGAATTTTATGTAATACCTATATTGATAGTTACAAAGGAACTGTAGATTGTGTGCAGGCTAG
>PDCD4
AGCTGCTCTGGATAAGGCTACCGTGCTTCTGAGTATGTCTAAAGGTGGAAAGCGTAAAGATAGTGTGTGGGGCTCTGGAGGTGGGCAGCAATCTGTCAATCACCTTGTTAAAGAG
>PDCD4
ATTGATATGCTGCTGAAAGAATATTTACTCTCTGGAGACATATCTGAAGCTGAACATTGCCTTAAGGAACTGGAAGTACCTCATTTTCACCATGAGCTTGTATATGAA
>PDCD4
GCTATTATAATGGTTTTAGAGTCAACTGGAGAAAGTACATTTAAGATGATTTTGGATTTATTAAAGTCCCTTTGGAAGTCTTCTACCATTACTGTAGACCAAATGAAAAGA
>PDCD4
GGTTATGAGAGAATTTACAATGAAATTCCGGACATTAATCTGGATGTCCCACATTCATACTCTGTGCTGGAGCGGTTTGTAGAAGAATGTTTTCAGGCTGGAATAATTTCCAAACAACTCAGAGATCTTTGTCCTTCAAG
>PDCD4
sequence • 2.0k views
ADD COMMENT
0
Entering edit mode

This will require a bit more than simple awk scripting I'm afraid. I'm also kinda curious why you want to do this? What's the goal of it?

ADD REPLY
0
Entering edit mode

i want to do this for divergent primer design , the goal is get the mature sequence from any transcript by joining them and then take 100 bp from the 5 prime end and join them 100 bp at the 3 prime end..can you help me how to do that?

ADD REPLY
0
Entering edit mode

How do you want to see them joined? 5AAAAABBBBB3 => 5BB35AA3 (first circularize and then cut) or => 5AA35BB3 (simply cut our middle part)

though I still don't get why you need to do this? why not just take the first 100bp and the last 100bp ? why you want to join them?

ADD REPLY
0
Entering edit mode

i want to get junction sequence for divergent primer design for circular rna detection

So it would be full mature sequence then, 100 bp from the 5 prime end join them to the 3 prime end ,that would be my junction sequence which i can give as input to any primer design tool

ADD REPLY
1
Entering edit mode

ah, it's for circular rna detection, that was my guess as well.

anyway ... you want output in fasta format?

ADD REPLY
3
Entering edit mode
6.8 years ago

This little perl script should do it:

#!/usr/bin/env perl
#
use strict;
use warnings;

my %seq;
my $id;

while (<STDIN>) {
    chomp;
    if ($_ =~ />(.+)/) { $id = $1;}
    else { $seq{$id} .= $_; }
}

foreach my $g ( keys %seq){
    print ">${g}_full\n$seq{$g}\n";
    if (length($seq{$g}) < 200){ print STDERR "seq too short\n"; next;}
    my ($f,$l) = ($seq{$g} =~ /^(.{100}).*(.{100})$/);
    print ">${g}_circCut\n" . reverse($f) ."n". reverse($l) ."\n";
}

exit;

it reads from STDIN (the file with the exons) and outputs to STDOUT

it also puts an 'n' in between the two pieces of sequence to indicate the junction

ADD COMMENT
1
Entering edit mode

I will let you know what kind of difficulty im facing ,but im really glad you wrote the script , the one i got from the person who left the stuff was written in C i m having a really hard time to go through the code and the logic..

ADD REPLY
0
Entering edit mode

Sorry for giving a response after a long time , I will post the original complicated C code but before that I m running you code I can't give it input , just to make sure I m running this perl yourcode.pl so I get a blank screen am I doing something wrong

ADD REPLY
1
Entering edit mode

I see, you need to run it as follows:

perl yourcode.pl < yourExonFile > output file
ADD REPLY

Login before adding your answer.

Traffic: 1828 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6