Resolving multiple entries to just first one in perl program.
2
0
Entering edit mode
6.1 years ago

Hi, I am trying to read a Gen bank file which I have successfully done. Now, I trying to fix this error in this program that where ever it is finding gene it is printing all the results. I just want gene to be printed once that's all. I tried looping or increasing the counter and then returning the value to 0 but at some place I am not able to implement the code properly. I am posting the code below. Thanks in advance ,

SAMPLE FILE

 LOCUS       NR_046018               1652 bp    RNA     linear   PRI 12-MAY-2017
DEFINITION  Homo sapiens DEAD/H-box helicase 11 like 1 (DDX11L1), non-coding RNA.
ACCESSION   NR_046018 XM_003403543
VERSION     NR_046018.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
ORGANISM    Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1652)
AUTHORS     Costa V, Casamassimi A, Roberto R, Gianfrancesco F, Matarazzo MR,
            D'Urso M, D'Esposito M, Rocchi M and Ciccodicola A.
TITLE       DDX11L: a novel transcript family emerging from human subtelomeric regions
JOURNAL     BMC Genomics 10, 250 (2009)
PUBMED      19476624
REMARK      Publication Status: Online-Only
COMMENT     VALIDATED REFSEQ: This record has undergone validation or
            preliminary review. The reference sequence was derived from
            AM992871.1.
            On Jun 5, 2012 this sequence version replaced NR_046018.1.

            ##Evidence-Data-START##
           Transcript exon combination :: AM992871.1, BM920886.1 [ECO:0000332]
           RNAseq introns              :: single sample supports all introns
                                       SAMEA1968968, SAMEA2148874
                                       [ECO:0000348]
           ##Evidence-Data-END##
PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
        1-1652              AM992871.1         1-1652
FEATURES             Location/Qualifiers
 source          1..1652
                 /organism="Homo sapiens"
                 /mol_type="transcribed RNA"
                 /db_xref="taxon:9606"
                 /chromosome="1"
                 /map="1p36.33"
 gene            1..1652
                 /gene="DDX11L1"
                 /note="DEAD/H-box helicase 11 like 1"
                 /pseudo
                 /db_xref="GeneID:100287102"
                 /db_xref="HGNC:HGNC:37102"
 misc_RNA        1..1652
                 /gene="DDX11L1"
                 /product="DEAD/H-box helicase 11 like 1"
                 /pseudo
                 /db_xref="GeneID:100287102"
                 /db_xref="HGNC:HGNC:37102"

CODE:

   open (INFILE,"rna.txt");
   while ($line= <INFILE>)
  {
     chomp($line);
     if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
      {
          print  "\n";
      print "Locus: $2\t";
       }
        elsif($line =~ /^\s*\/gene\=\"(.+)\"/ )
    {
           print "Gene: $1\n";
     }
   }

After this script is run the output is -

   LOCUS: NR_046018       Gene: DDX11L1
   Gene: DDX11L1
perl programming database • 2.7k views
ADD COMMENT
0
Entering edit mode

Hello,

if you only have one locus in the file, you can just leave the loop by using the last statement if you have found the gene line.

fin swimmer

ADD REPLY
0
Entering edit mode

Hello, I have a long file, I just posted a short file here. I am trying last statement but not able to get the desired result. It would be great if you could explain with a small example.

ADD REPLY
0
Entering edit mode

I'm not familiar with perl. Try this:

open (INFILE,"rna.txt");

while ($line= <INFILE>)
{
    chomp($line);
    if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
    {
        print  "\n";
        print "Locus: $2\t";
        $gene = 0;
    }
    elsif($line =~ /^\s*\/gene\=\"(.+)\"/ && !gene)
    {
        print "Gene: $1\n";
        $gene = 1;
    }
}

fin swimmer

ADD REPLY
0
Entering edit mode

Looks good, just remember to declare and initialize variables so it works with strict:

my $gene = 1; # don't print anything before we have seen a LOCUS tag
open (INFILE,"rna.txt");
....
ADD REPLY
0
Entering edit mode

Hi Kriti,

This should work as mentioned by finswimmer

Notice the last function

open (INFILE,"GB.txt");
while ($line= <INFILE>)
{
chomp($line);
if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
{
print  "\n";
print "Locus: $2\t";
}
elsif($line =~ /^\s*\/gene\=\"(.+)\"/ )
{
print "Gene: $1\n";
last;
}
}

Output

Locus: NR_046018    Gene: DDX11L1
ADD REPLY
0
Entering edit mode

Hello sir, Thanks for your reply. I have done this, the last statement her is not useful because the Genbank file has other Locus and genes too. I hope I am able to explain properly. Suggest a method which helps in matching this line:

  gene            1..1652

and then matches this line:

  /gene="DDX11L1"
ADD REPLY
0
Entering edit mode

Got the point. Shall get back to this. Also, please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode
6.1 years ago
Michael 55k

While this script might work in your special case, I highly recommend to use the BioPerl GenBank parser instead. There are possibly scenarios where the parsing approach could fail, e.g. "false positives" (where the /gene= string appears outside of the opening gene environment), or where there are multiple genes per locus, the order of tag per locus is different from what is expected, etc.. The documentation/tutorial at https://bioperl.org/howtos/Features_and_Annotations_HOWTO.html#item12

shows specifically how to extract the values of primary_tags, which "gene" is one of.

ADD COMMENT
0
Entering edit mode
6.1 years ago

Hi, This is the possible answer I could come up with.

 open (INFILE,"rna.txt");
  $gene=0;
 while ($line= <INFILE>)
{
   chomp($line);
   if ($line =~ /(LOCUS\s*)(\w*)(.*)/)
   {
      print  "\n";
     print "Locus: $2\t";
    }
      elsif ($line=~ /(\s*gene\s*)(\d*)(\.\.)(\d*)/)
    {
    $begin= $2;
    $end= $4;
       print  "Gene_length: $begin..$end\t";
            $gene = 1;
     }
    elsif($gene == 1 && $line=~m /\s+\/gene\=\"(.+)\"/)
   {
    print " Gene $1\t";
    $gene = 0;
   }
}
ADD COMMENT

Login before adding your answer.

Traffic: 1627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6