Hello researchers ,
I am stuck in perl script on RNA g quadruplex to find/count the total number of specific unique sequences in which G should be 3 runs and loops should be 7 only. I used grep function with regular expression and array in which I gave input as a fasta file and in the last I counted regular expression in which the code is running means the perl script, but answers are not correct means answer is coming same for all different types of regular expression, can any one please help me out with the same?
Can any one share perl script for unique / specific sequences total counts of G 3 and L7 ONLY?
I used the most common regular expression : ([gG]{3,}\w{1,7}){3,}[gG]{3,}
I tried simple code syntax=grep_function(regular_expression,@array)
This is the full script:
#!/usr/bin/perl-w
#To count total transcripts containing G-Quadruplexes
#Input filename
print "Please enter file name: ";
$name =<>;
chomp $name;
open OUT ,">.$name.OUTPUT";
open(FASTA,$name) or die;
@data =<FASTA>;
$data = join('',@data); #Convert to string
@data2 = split('\n',$data); #Explode on newline into array elements
@unique = grep(!$seen{$_}++,@data2); #Extract unique elements from @data2
$unique = join('',@unique); #Convert to string
@uniqueid = split('',$unique); #Explode string back into individual array elements.
#Intialize count
$countid=0;
foreach $id(@uniqueid)
{
if($id eq "N")
{
++$countid;
}
}
#Print
print "\n\nNumber of transcripts is : $countid";
print OUT "Number of unique transcripts is : $countid";
#Exit
exit;
Thank you
Hi Isha,
could you please revise the question to include more information about the background of the problem. It might for example not be clear to all readers what a G4 (a non-canonical nucleotide secondary structure) is, why and where (RNA, DNA) you are looking for them. Also it is unclear which approach you have tried and specifically which regular expression was used. You have to post the code you used for us to be able to spot any errors.
Also, see here: Quadruplex sequence batch prediction
Hello Michael Dondrup, I constructed transcriptome sequences through reference based RNA-seq data , G4 as u said was correct its a non canonical nucleotide secondary structure,
syntax code =
grep_function(regular_expression,@array)
I have uploaded my full script , can u please help me out?
Hi, it is still not clear. Your script doesn't contain a pattern search, it is not even doing anything sensible now, except trying to make lines unique which is likely not what you want and does not work in the way you expect either. First, make your script into a proper strict syntax perl script. A proper perl script should start like this (works anywhere, except maybe for old SunOS/Solaris versions that have /bin/env instead of /usr/bin/env):
__END__ is optional, but you don't need to call exit at the end of the script.
Then properly define all package variables using
my
and go from there. I also recommend to use the BioPerl Fasta parser in your code, because your code as is doesn't parse the format properly.Hi Michael,
I have written half only, will execute other half part by one or 2 day:
Are you saying like this:
This should be pretty straight forward to filter sequences which contain the G4 motif. (I didn't understand why you opened more files at the end.)
Call the script with the input file as parameter and see if that works.
Hi Michael
I think script can be made directly through perl also without using bioperl package. What you say, actually because I am getting confused. Other half part will take time, loops I'll use next in the script.
I am nor sure what you mean here due to language problems. Sure, you can parse FASTA files without BioPerl, but why if it works, and it works fine for me. The most difficult and lengthy part is to install BioPerl, and with conda that is not even a big thing any more. I never had problems with speed when processing a genome or transcriptome. You will find plenty of examples for parsing a FASTA file directly in perl though.
Hi Michael, I tried this script, still it needs improvement , what are your suggestions?
I suggest you use BioPerl an the script I posted :)
Hello Michael, can you show me your complete / full script, so that I can co -relate with my script?
or
if you can't, than can you tell me what algorithm you are using for making scripts?