Question

Perl script on rna G quadruplex

0

Entering edit mode

3.9 years ago

isha.lily20 ▴ 10

Hello researchers ,

I am stuck in perl script on RNA g quadruplex to find/count the total number of specific unique sequences in which G should be 3 runs and loops should be 7 only. I used grep function with regular expression and array in which I gave input as a fasta file and in the last I counted regular expression in which the code is running means the perl script, but answers are not correct means answer is coming same for all different types of regular expression, can any one please help me out with the same?

Can any one share perl script for unique / specific sequences total counts of G 3 and L7 ONLY?

I used the most common regular expression : ([gG]{3,}\w{1,7}){3,}[gG]{3,}

I tried simple code syntax=grep_function(regular_expression,@array)

This is the full script:

#!/usr/bin/perl-w

#To count total transcripts containing G-Quadruplexes
#Input filename

print "Please enter file name: ";
$name =<>;
chomp $name;

open OUT ,">.$name.OUTPUT";
open(FASTA,$name) or die;

@data =<FASTA>;
$data = join('',@data); #Convert to string

@data2 = split('\n',$data); #Explode on newline into array elements
@unique = grep(!$seen{$_}++,@data2); #Extract unique elements from @data2

$unique = join('',@unique); #Convert to string
@uniqueid = split('',$unique); #Explode string back into individual array elements.

#Intialize count
$countid=0;

foreach $id(@uniqueid)
{
    if($id eq "N")
    {
    ++$countid;
    }
}

#Print
print "\n\nNumber of transcripts is : $countid";
print OUT "Number of unique transcripts is : $countid";

#Exit
exit;

Thank you

perl G-quadruplex rna • 2.3k views

ADD COMMENT • link updated 9 months ago by Ram 45k • written 3.9 years ago by isha.lily20 ▴ 10

1

Entering edit mode

Hi Isha,

could you please revise the question to include more information about the background of the problem. It might for example not be clear to all readers what a G4 (a non-canonical nucleotide secondary structure) is, why and where (RNA, DNA) you are looking for them. Also it is unclear which approach you have tried and specifically which regular expression was used. You have to post the code you used for us to be able to spot any errors.

Also, see here: Quadruplex sequence batch prediction

ADD REPLY • link updated 3.9 years ago by Ram 45k • written 3.9 years ago by Michael 55k

0

Entering edit mode

Hello Michael Dondrup, I constructed transcriptome sequences through reference based RNA-seq data , G4 as u said was correct its a non canonical nucleotide secondary structure,

syntax code = grep_function(regular_expression,@array)

I have uploaded my full script , can u please help me out?

ADD REPLY • link updated 9 months ago by Ram 45k • written 3.9 years ago by isha.lily20 ▴ 10

1

Entering edit mode

Hi, it is still not clear. Your script doesn't contain a pattern search, it is not even doing anything sensible now, except trying to make lines unique which is likely not what you want and does not work in the way you expect either. First, make your script into a proper strict syntax perl script. A proper perl script should start like this (works anywhere, except maybe for old SunOS/Solaris versions that have /bin/env instead of /usr/bin/env):

#!/usr/bin/env perl
use strict; 
use warnings;
## your code here

__END__

__END__ is optional, but you don't need to call exit at the end of the script.

Then properly define all package variables using my and go from there. I also recommend to use the BioPerl Fasta parser in your code, because your code as is doesn't parse the format properly.

ADD REPLY • link 3.8 years ago by Michael 55k

0

Entering edit mode

Hi Michael,

I have written half only, will execute other half part by one or 2 day:

Are you saying like this:

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

## #Intialize the variables

my $gpat = '[G]{3}';
my $npat = '[A-Z]{1,7}';

my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;

`sed -i '1s/^/>/' genome.fa`;
my $seqin  = Bio::SeqIO->new(-file => "genomeome.fa",      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => ">genome_100.fa", -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->length <= 100) {
    $seqout->write_seq($seq);
  }
}


#Open output file and input file respectively
open OUT, ">Quadwhatposn.refMrna.fa" or die;
open my $fh, '<', 'transcriptome.fa' or die $!;

ADD REPLY • link updated 9 months ago by Ram 45k • written 3.8 years ago by isha.lily20 ▴ 10

1

Entering edit mode

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

## #Intialize the variables

my $gpat = '[G]{3,}'; # I think more than 3 Gs could also yield a G4
my $npat = '[A-Z]{1,7}';

my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i; # you let perl compile the pattern here, ok, this might have mild speed advantage

# I don't understand why you would call sed for inplace editing here, don't conflate the tasks
# if you run the script twice, the sed command will be called again and again 
# make a proper input file in the first go

my $seqin  = Bio::SeqIO->new(-file => $ARGV[0],      -format => "fasta");
my $seqout = Bio::SeqIO->new(-file => '>'.$ARGV[0].'_G4.fa', -format => "fasta");
while(my $seq = $seqin->next_seq) {
  if($seq->seq =~ /$regex/) { 
    # write the sequences that match the G4 pattern only to the output file
    # add the length filter too, if you need to
    $seqout->write_seq($seq);
  }
}
__END__

This should be pretty straight forward to filter sequences which contain the G4 motif. (I didn't understand why you opened more files at the end.)

Call the script with the input file as parameter and see if that works.

ADD REPLY • link updated 9 months ago by Ram 45k • written 3.8 years ago by Michael 55k

0

Entering edit mode

Hi Michael

I think script can be made directly through perl also without using bioperl package. What you say, actually because I am getting confused. Other half part will take time, loops I'll use next in the script.

#!/usr/bin/perl

use strict;
use warnings;

#Intialize the variables
my $gpat = '[G]{4}';
my $npat = '[A-Z]{1,7}';

my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;


#Open output file and input file respectively
open OUT, ">Quadwhatposn.refMrna.fa" or die;
open my $fh, '<', 'transcriptome.fa' or die $!;

my ($id, $seq);
my @totals = (0); #intialize the @counts arrays

ADD REPLY • link updated 9 months ago by Ram 45k • written 3.8 years ago by isha.lily20 ▴ 10

2

Entering edit mode

I am nor sure what you mean here due to language problems. Sure, you can parse FASTA files without BioPerl, but why if it works, and it works fine for me. The most difficult and lengthy part is to install BioPerl, and with conda that is not even a big thing any more. I never had problems with speed when processing a genome or transcriptome. You will find plenty of examples for parsing a FASTA file directly in perl though.

ADD REPLY • link 3.8 years ago by Michael 55k

0

Entering edit mode

Hi Michael, I tried this script, still it needs improvement , what are your suggestions?

#!/usr/bin/perl

use strict;
use warnings;

#Intialize the variables
my $gpat = '[G]{4}';
my $npat = '[A-Z]{1,7}';

my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;


#Open output file and input file respectively
open OUT, ">Quadwhatposn.refMrna.fa" or die;
open my $fh, '<', 'transcriptome.fa' or die $!;

my ($id, $seq);
my @totals = (0); #intialize the @counts arrays

#Loop
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;

$id = $1;
$seq = '';
print "$id\n";
print OUT "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
#Print
print "@totals\n";
print OUT "Total: \n@totals\n";

ADD REPLY • link updated 9 months ago by Ram 45k • written 3.8 years ago by isha.lily20 ▴ 10

1

Entering edit mode

I suggest you use BioPerl an the script I posted :)

ADD REPLY • link 3.8 years ago by Michael 55k

0

Entering edit mode

Hello Michael, can you show me your complete / full script, so that I can co -relate with my script?

or

if you can't, than can you tell me what algorithm you are using for making scripts?

ADD REPLY • link updated 9 months ago by Ram 45k • written 3.8 years ago by isha.lily20 ▴ 10