First Intron extraction from ensembl
2
0
Entering edit mode
10.4 years ago
ravihansa82 ▴ 130

Friends, I need to extract only the First intron sequences from one transcript of each gene, can anybody help me to do this

sequence gene • 5.0k views
ADD COMMENT
3
Entering edit mode

What have you tried? Also, "each gene" might be quite a lot..

ADD REPLY
0
Entering edit mode

I used BioPerl script ....but since Ensembl having separate table for intron and exon ...so I think there is a way to extract only the colum(database field) which contain the first intron sequences

ADD REPLY
2
Entering edit mode

Which organism?

ADD REPLY
0
Entering edit mode

specially I need to download first intron of human of TSP and HK genes and later I need to download first intron of set of functionally related genes.

ADD REPLY
1
Entering edit mode
10.4 years ago
Emily 24k

You'll need to use the Ensembl Perl API for this. Check out this online course to learn more.

ADD COMMENT
1
Entering edit mode

If you're going to use the Ensembl Perl API, you first want to retrieve the transcripts you're interested in and subsequently use

my @introns = @{$transcript->get_all_Introns()};

This should give all introns for the transcript in question in 5'>3' order.

ADD REPLY
0
Entering edit mode

Thank you..I tried with that .... but can I extract only the field of first intron sequences...and basically I have set of genes then I need only extract first intron of such set of of genes....I think I have to use such set of genes as input...then have to connect to the ensemble and extarct only first intron of such set of genes...but i am bit confuse with that...can you help me...

ADD REPLY
1
Entering edit mode

Here is an example script, that you can amend according to your exact needs:

#!/usr/bin/perl

use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

## Load the databases into the registry
$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

## Get the gene adaptor for human
my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );

## Fetch your gene of interest 
my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000156515' );

## Get all transcripts for the gene
my $transcripts = $gene->get_all_Transcripts;

## For each transcript get all introns and print the sequence of the first intron of each
foreach my $transcript(@$transcripts){
  my $introns = $transcript->get_all_Introns;
  my $first_intron = @$introns[0];
  print
    "\>", $gene->stable_id,";",$transcript->stable_id,";",$gene->external_name, "\n",
    $first_intron->seq, "\n";
}

How to install the Ensembl API and more documentation you can find here.

ADD REPLY
0
Entering edit mode

Thanks a lot.....no words to explain your help...by the way...I think I can use my individual gene IDs in my notepad and running a for loop can I use one gene at a time instead of one fixed gene ID as you have shown above...

ADD REPLY
1
Entering edit mode

Sure, I just hardcoded one gene ID here as an example, but of course you can loop through as many Ensembl gene IDs as you want.

ADD REPLY
0
Entering edit mode

By the way when I install Ensemblr API in to Window I got following error when running script

Can't locate Bio/EnsEMBL/Registry.pm in @INC (@INC contains: C:/Users/dell/workspace/.metadata/.plugins/org.epic.debug C:/Users/dell/workspace/first_intron src\ensembl-funcgen\modules C:/Perl/site/lib C:/Perl/lib .)

Any suggesions....

ADD REPLY
1
Entering edit mode

Please have a look at this Ensembl blog post. If that doesn't solve your problem, you can mail the Ensembl Helpdesk (helpdesk@ensembl.org) for assistance. Installing these kind of things on Windows is often a bit more tricky than on Unix/Linux/Mac OS X.

An alternative is using the Ensembl virtual machine. This is really very straightforward. Just download and install VirtualBox and then download the Ensembl virtual machine. Just did it today myself and it was extremely easy.

ADD REPLY
0
Entering edit mode

I downloaded and installed the Ensemble virtual machine...but once I click start button unexpectedly I got the following error message...

Can't access kernel driver!
make sure the kernel module has been loaded successfully...

what a headache this....plz help me...

ADD REPLY
0
Entering edit mode

OMG...Bert Overduin ......Thank you very much...Finally I got the sequences....Thank you very much for your kind guidance......please need your help in future as well....

ADD REPLY
0
Entering edit mode

Hi Bert...I think I need your help....I run the script you provided me other day....I have some problems on that

  1. I have set of gene IDs and I need to take each gene ID as input for the Ensembl
  2. I need only one first transcript (not all transcripts) and first intron of such first transcript
  3. I need to store resulting sequences in a separate folder

can you guide me Bert....

ADD REPLY
1
Entering edit mode

If you only want one transcript per gene, you can use instead:

## Get canonical transcript for the gene
my $transcript = $gene->canonical_transcript;

## For the canonical transcript get all introns and print the sequence of the first intron of each
  my $introns = $transcript->get_all_Introns;
  my $first_intron = @$introns[0];
  print
    "\>", $gene->stable_id,";",$transcript->stable_id,";",$gene->external_name, "\n",
    $first_intron->seq, "\n";

The canonical transcript is usually the longest transcript of the gene (From the Ensembl documentation "For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript.").

Your questions 1 and 3 are really general Perl questions. This is not the place to answer these, but you should be able to find the answer by doing some Googling.

ADD REPLY
0
Entering edit mode

HI Bert...I have modified the script and start to run....it seems it gives the first intron sequences at the same time it gives some error massage as well..I used following ensembl gene IDs and run the cord...it gives out file as well but some error msg as;

Can't call method "canonical_transcript" on an undefined value at ./sample.pl line 40, <FILE> line 14.

if I used more sequences for example i used 261 ensemble gene IDs , that also give few sequences as out put and unable to perform some function and gives following type of errors;

Can't call method "length" on an undefined value at ./sample.pl line 49, <FILE> line 264.
Can't call method "seq" on an undefined value at ./sample.pl line 49, <FILE> line 264

Can just go through the script...and give advises on that

Here is the script..

#!/usr/bin/perl

use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::SeqIO;
use Getopt::Long;

my $registry = 'Bio::EnsEMBL::Registry';

## Load the databases into the registry
$registry->load_registry_from_db(
  -host => 'ensembldb.ensembl.org',
  -user => 'anonymous'
);

## Get the gene adaptor for human
my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );

# declearing global variable 
my @arra = @_;
my $line = $_;

   #opening the output file and write the sequences for abc file
   open (FILE_OUTPUT, ">my_sequence") or die "can't write";

   #opening the input file which is having ensemble gene IDS
   open (FILE,"my_ID") or die "Error";

    my @array = <FILE>;#assigning input file content to @array 
        #run foreach loop and real each line by line and each line assigning to $line var
    foreach $line(@array){
      # this removes new line charactor
      chomp $line;
        # Fetch my gene of interest usning ensemble ID
        my $gene = $gene_adaptor->fetch_by_stable_id( $line );

        ## Get all transcripts for the gene
        ## Get canonical transcript for the gene which is the longest transcript
        my $transcript = $gene->canonical_transcript;

        ## For the canonical transcript get all introns and print the sequence of the first intron of each
          my $introns = $transcript->get_all_Introns;
        #convert $intron to an array and refer its first element and assing it to
        #first_intron var
          my $first_intron = @$introns[0];
        # print the gene stable ID, transcript stable ID,gene external name ,
        #seq length and firstintron sequence to the output fiel
          FILE_OUTPUT ->print
          ("\>", $gene->stable_id,";",$transcript->stable_id,";",$gene->external_name,";","sequence length: ",$first_intron->length,
        "\n",$first_intron->seq,"\n");

}
#close input file
close (FILE);
#close output file
close (FILE_OUTPUT);

Here are the sample ensemble gene IDs

ENSG00000127837
ENSG00000206324
ENSG00000226467
ENSG00000228892
ENSG00000236873
ENSG00000227642
ENSG00000204310
ENSG00000235758
ENSG00000169692
ENSG00000160216
ADD REPLY
0
Entering edit mode

Your script works perfectly fine for me with the sample Ensembl (not Ensemble) IDs you provided. I am not getting any error messages. Can you provide me with the longer list of IDs, to see if I get any error messages with those? Cheers, Bert.

ADD REPLY
1
Entering edit mode

Looking again at your error messages, I think I know what is happening .... :) The error message <FILE> line 264 means that the error occurs in the 264th line of your input file, so after the actual list of IDs. I strongly suspect you have some empty trailing lines (with return) after the list of IDs, which cause the error messages. Just delete them and see what happens. Your script is perfectly fine.

ADD REPLY
0
Entering edit mode

Dear Bert

Yes as you mentioned I also suspected that may due to the empty rows at the end of the input file then I removed them and run the script again but every time my script read up to 68 input sequences and give the error as previously. following I have added set of my sequences(261 house keeping Ensembl gene IDs)..can you just try on that please.

ENSG00000127837
ENSG00000206324
ENSG00000226467
ENSG00000228892
ENSG00000236873
ENSG00000227642
ENSG00000204310
ENSG00000235758
ENSG00000169692
ENSG00000160216
ENSG00000026652
ENSG00000155189
ENSG00000158669
ENSG00000143761
ENSG00000134287
ENSG00000168374
ENSG00000004059
ENSG00000165527
ENSG00000163466
ENSG00000111229
ENSG00000241553
ENSG00000162704
ENSG00000138363
ENSG00000154518
ENSG00000236227
ENSG00000215077
ENSG00000230678
ENSG00000235307
ENSG00000234507
ENSG00000234704
ENSG00000204256
ENSG00000169925
ENSG00000141867
ENSG00000166164
ENSG00000145741
ENSG00000159388
ENSG00000154640
ENSG00000137707
ENSG00000116161
ENSG00000162909
ENSG00000092529
ENSG00000262875
ENSG00000149260
ENSG00000077274
ENSG00000131375
ENSG00000223639
ENSG00000213719
ENSG00000230685
ENSG00000206394
ENSG00000226651
ENSG00000226248
ENSG00000226417
ENSG00000155962
ENSG00000268943
ENSG00000169583
ENSG00000169504
ENSG00000112782
ENSG00000159212
ENSG00000122218
ENSG00000112695
ENSG00000127184
ENSG00000224774
ENSG00000228875
ENSG00000206406
ENSG00000230700
ENSG00000232960
ENSG00000224398
ENSG00000204435
ENSG00000213088
ENSG00000161202
ENSG00000156508
ENSG00000101210
ENSG00000114942
ENSG00000156976
ENSG00000141543
ENSG00000122176
ENSG00000179163
ENSG00000001036
ENSG00000135821
ENSG00000078369
ENSG00000172354
ENSG00000111664
ENSG00000268195
ENSG00000114450
ENSG00000069966
ENSG00000163041
ENSG00000206505
ENSG00000229215
ENSG00000235657
ENSG00000231834
ENSG00000206503
ENSG00000223980
ENSG00000227715
ENSG00000224320
ENSG00000223532
ENSG00000206450
ENSG00000224608
ENSG00000228964
ENSG00000232126
ENSG00000234745
ENSG00000233841
ENSG00000228299
ENSG00000206452
ENSG00000237022
ENSG00000206435
ENSG00000225691
ENSG00000204525
ENSG00000230141
ENSG00000206292
ENSG00000235744
ENSG00000232957
ENSG00000232962
ENSG00000231558
ENSG00000204252
ENSG00000206305
ENSG00000232062
ENSG00000225890
ENSG00000196735
ENSG00000236418
ENSG00000228284
ENSG00000225103
ENSG00000223793
ENSG00000233192
ENSG00000231526
ENSG00000204276
ENSG00000231823
ENSG00000237541
ENSG00000206301
ENSG00000206302
ENSG00000233209
ENSG00000231939
ENSG00000206237
ENSG00000225824
ENSG00000179344
ENSG00000231286
ENSG00000196610
ENSG00000228813
ENSG00000230675
ENSG00000226165
ENSG00000224305
ENSG00000229493
ENSG00000232629
ENSG00000228254
ENSG00000234968
ENSG00000234510
ENSG00000228684
ENSG00000224197
ENSG00000226030
ENSG00000226343
ENSG00000224793
ENSG00000231596
ENSG00000206308
ENSG00000234794
ENSG00000227993
ENSG00000226260
ENSG00000204287
ENSG00000228987
ENSG00000230726
ENSG00000229252
ENSG00000206493
ENSG00000236632
ENSG00000230254
ENSG00000225201
ENSG00000204592
ENSG00000233904
ENSG00000198830
ENSG00000118418
ENSG00000182952
ENSG00000198157
ENSG00000153187
ENSG00000144381
ENSG00000186432
ENSG00000196911
ENSG00000025800
ENSG00000185467
ENSG00000115053
ENSG00000162736
ENSG00000143799
ENSG00000129484
ENSG00000041880
ENSG00000102699
ENSG00000137817
ENSG00000117592
ENSG00000205220
ENSG00000159377
ENSG00000100804
ENSG00000142507
ENSG00000136930
ENSG00000206298
ENSG00000230034
ENSG00000230669
ENSG00000231631
ENSG00000235715
ENSG00000236443
ENSG00000204264
ENSG00000226201
ENSG00000243594
ENSG00000240118
ENSG00000242711
ENSG00000243958
ENSG00000243067
ENSG00000239836
ENSG00000240065
ENSG00000240508
ENSG00000175166
ENSG00000108344
ENSG00000159352
ENSG00000095261
ENSG00000163636
ENSG00000103035
ENSG00000187514
ENSG00000153250
ENSG00000076067
ENSG00000144642
ENSG00000157916
ENSG00000117748
ENSG00000106399
ENSG00000204086
ENSG00000142676
ENSG00000197958
ENSG00000167526
ENSG00000188846
ENSG00000174748
ENSG00000145592
ENSG00000197756
ENSG00000172809
ENSG00000198918
ENSG00000229117
ENSG00000186468
ENSG00000138326
ENSG00000262088
ENSG00000118181
ENSG00000197728
ENSG00000177954
ENSG00000233927
ENSG00000213741
ENSG00000197747
ENSG00000163191
ENSG00000163221
ENSG00000189171
ENSG00000189334
ENSG00000188643
ENSG00000138385
ENSG00000163479
ENSG00000114850
ENSG00000180879
ENSG00000269520
ENSG00000117632
ENSG00000104435
ENSG00000197457
ENSG00000015592
ENSG00000158710
ENSG00000144834
ENSG00000118194
ENSG00000130595
ENSG00000143549
ENSG00000167460
ENSG00000047410
ENSG00000117289
ENSG00000163159
ENSG00000134308
ADD REPLY
1
Entering edit mode

The script crashes at ENSG00000213088. The canonical transcript of this gene ENST00000368121 does not have a first intron as it is a one exon transcript. So, you have to include a check for this in your script.

ADD REPLY
0
Entering edit mode

Dear Bert..thank you very much for your help for me..I modified the perl script and I downloaded the all sequences of first intron with out any trouble....

Bert..as my second task, I wanna find already known pattern(cis-elements) occurrence in above set of sequences. I am referring set of pattern matching algorithm these days. Bert we can also use regular expression to search pattern and get the exact position of them isn't it? what is you idea...

ADD REPLY
1
Entering edit mode

This is not the right place to discuss this as it's a new question. Also, I am pretty sure this has been asked before, so please have a search on the site first, before posting it.

ADD REPLY
0
Entering edit mode

thank you friend...I am really sorry if I am make you in trouble....

ADD REPLY
0
Entering edit mode

Dear Bert,

I need a help from you. I have written a perl program for KMP algorithm. program runs perfectly but it gives some worning like,

Replacement list is longer than search list at C:/Perl/site/lib/Bio/Range.pm line 251.
Use of uninitialized value in string ne at C:/Users/dell/workspace/subroutine/sub.pl line 99, <GEN0> line 2.
Use of uninitialized value in string ne at C:/Users/dell/workspace/subroutine/sub.pl line 99, <GEN0> line 2
...................................................................................................................................................
.................................................................................................................................................

Here is the program

#!/usr/bin/perl -w

#use strict;
use Bio::SeqIO;
use Bio::Seq;

#my $seq_obj;
my $text;
my $pattern ="TATAAAT";
my $seq_obj;

open ("OUT_PUT",">C:\\Users\\dell\\Desktop\\GGGCGG.txt");

# defining the subroutine for prefix table for my pattern
sub PreprocessTable{

        # initializing first two valaue of the prefixtable array as -1 and 0
        my ($i,$j) = (0,-1);

        # converstion of pattern string into an array
        my @PtrnArray = split //,$pattern;

        # get the length/size of pattern array length
        my $PtrnArrayLength = scalar @PtrnArray;

        # initialize the Table array with known size(here the size of the array is 8) 
        # which is one index adding to the pattern array
        my @TableArray;$TableArray[$PtrnArrayLength]=undef;

        # assigning the  value of $j which is -1 as first index value of TableArray
        $TableArray[$i]= $j;

            # start to generate the TableArray which give out put as -1,0,0,1,2,0,0,1
            while ($i < $PtrnArrayLength){ 
                # check whether the adjecent characters of pattern
                # matching or not, if firsr condition and the adjecent 
                # two characters arenot match inner while loop is running
                while ($j >=0 && $PtrnArray[$i] ne $PtrnArray[$j]){

                    # if inner loop running, assigne first index value of
                    # Tabale Array as -1
                    $j= $TableArray[$j];
                    }
                        # increment both $i,$j
                        $i++;
                        $j++;
                        # output gives as -1,0,0,1,2,0,0,1
                        $TableArray[$i]=$j;
                }

        # return the TabaleArray for the further manupulation with KMPalgoSearch subroutine
        return (@TableArray)
} 
# pass the argument, in my case I am passing $pattern in order to
# generate the TableArray i.e. prefixtable
PreprocessTable($pattern);

# defining the pattern matching subroutine 

sub KMPalgoSearch{
    # open thw sequence files which is in fasta format
    my $seqio_obj = Bio::SeqIO->new(-file => "C:\\Users\\dell\\Desktop\\first.fasta", -format => "fasta",-alphabet => "dna");

    # read one sequence at a time                                                                       
    while ($seq_obj = $seqio_obj->next_seq){   
        # print the sequence  ID,sequence description,sequence
        # this helps to upload all the sequences to Exel ot Access database 
           my $count=0; 
         my $text = $seq_obj->seq;
        my $len =length($text);    

        # initialize the two scalar variable $i, $j with 0
         my ($i, $j) =(0,0);

     # call the PreprocessTable subroutine and pass $pattern as its argument
     # and this is assign to the TableArray as return from PreprocessTable subroutine
     my (@TableArray)= PreprocessTable($pattern);

     # both pattern and text coming to this subroutine as string
     # now convert both pattern and text into array for the further manipulation
     # within this subroutine
     my @pattern=split //,$pattern;
     my @text = split // ,$text;

     # obtain and assigne the length/size of both pattern and text arrays
     my $ptrnLen = scalar @pattern;
     my $textLen = scalar @text;
     # this regular expression used to print the sequences that do not having pattern

     if( $text !~ m/TATAAAT/g && $count==0){
                         OUT_PUT -> print  ($seq_obj->display_id);

                }
     # start to match the pattern charactoers with text characters
     while ( $i < $textLen){

         while ( $j >= 0 && $text[$i] ne $pattern[$j]){

             # if first condition and the pattern and text characters are not match
             # have to refer the prefix table which is generated by first subroutine
             # since it has called within KMPalgoSearch subroutine now refering fisrt 
             # value of TableArray which is -1 and continue untill mach encount

             $j= $TableArray[$j];
         }
             $i++;
             $j++;

                     # return the match position of text once the pattern length and $j equals to each other 
                     # $count helpls to get the perticular sequence only

                     if ($j==$ptrnLen && $count==0){

                     # here return the match position.I added +1 to get the actual position
                     # since ($i-$ptrnLen) return the index
                         OUT_PUT -> print  ( $seq_obj->display_id,";",$seq_obj->desc,";",($i-$ptrnLen+1));

                         $count = $count+1;
                     } 
                     # if #count >0 print only location ,here I omit the douplicate printing of seq ID
                     elsif( $j==$ptrnLen && $count>0){
                         OUT_PUT -> print  (";",($i-$ptrnLen+1));

                }


             }
             # print a new line after the whlie loop before take second text string
             OUT_PUT -> print ("\n");

     }

     print "done","\n"    ;
}
# pass the arguments, here pass both $text and $pattern
KMPalgoSearch($pattern);

If you don't mind may I have a explanation on this?

Thank you

ADD REPLY
1
Entering edit mode

Hello,

As this question has nothing to do with your original question, you shouldn't put it here, but you should post it as a new question.

Second, I am afraid I'm not a Perl expert, so I am afraid I cannot help you with this.

Cheers.

ADD REPLY
1
Entering edit mode

It actually got posted as a new question yesterday.

ADD REPLY
1
Entering edit mode
10.4 years ago

You may also try JEnsembl if you prefer Java.

ADD COMMENT

Login before adding your answer.

Traffic: 1357 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6