Perl is difficult to learn for beginners without any programming experiment.
Maybe we can paste some code of some basic tasks like reading FASTA sequences
to give some examples and encourage beginners to make the first step.
Reading code of others is also an important way of learning programming language.
It's really difficult to debug others' Perl code.
Here's a rewritten version with detailed comments.
Seqs:
$ cat dataset1.txt
>1
AAAAAAATGC
ATCGATCGAC
>2
AAAAAAGTCG
ATCGATCAGC
Motifs:
$ cat Motif6.txt
AAAAAA
AAAAAT
AAAAAG
AAAAAC
Code:
#!/usr/bin/env perl
use strict;
# read motifs
my $file_motif = "Motif6.txt";
my @motifs = (); # an array to store motifs
open my $fh, "<", $file_motif or die "fail to open file: %file_motif";
for my $line (<$fh>) {
$line =~ s/\r?\n//g; # chomp "\r" and "\n"
next if $line eq "" or $line =~ m/\s+/; # skip blank line
push @motifs, $line; # append to array
}
close $fh;
# read sequences and count motifs
my $file_seq = "dataset1.txt";
my %counter = (); # a hashmap to store countings
open my $fh, "<", $file_seq or die "fail to open file: $file_seq";
my ($header, $seq) = ("", "");
for my $line (<$fh>) {
$line =~ s/\r?\n//g; # chomp "\r" and "\n"
if (substr( $line, 0, 1 ) eq '>') { # FASTA header line
unless ($header eq "" and $seq eq "") { # previous sequence
&count_motifs($header, $seq);
}
$header = substr ($line, 1); # update header
$seq = ""; # reset seq
} else { # sequence line
$seq = $seq . $line # concatenate sequence
}
}
close $fh;
unless ($header eq "" and $seq eq "") { # do not forget the last sequence
&count_motifs($header, $seq);
}
# print result, in decending order of count
for my $motif (sort {$counter{$b} <=> $counter{$a}} keys %counter) {
print "$motif\t$counter{$motif}\n";
}
sub count_motifs {
my ($header, $seq) = @_;
for my $motif (@motifs) {
my $n = 0;
my $len_motif = length $motif;
my ($begin, $end) = (-1, -1);
# if you just want to test whether the seq contains a motif, use:
$n++ if $seq =~ m/$motif/i;
# if you want to find all matches in sequences, use:
# while ($seq =~ m/$motif/gi) { # use regular expression to locate motifs
# my $pos = pos $seq; # http://perldoc.perl.org/functions/pos.html
# ($begin, $end) = ($pos - $len_motif + 1, $pos);
# print STDOUT "seq: $header\tmotif: $motif\tlocation: $begin-$end\n";
# pos $seq = $pos - $len_motif + 1;
# $n++;
# }
$counter{$motif} += $n;
}
}
Result:
$ perl count_motifs_in_seqs.pl
AAAAAA 2
AAAAAG 1
AAAAAT 1
AAAAAC 0
Start with this:
Edit: And don't forget that when reading lines from a file, you're also reading the end of line character(s).
I've done what I can to fix the formatting.
how i can see the formatting?
Do you want to do it in PERL only? you can get this done using simple
grep
commandsir i have to do it in perl only!!!
i see, it's a homework:)
An entire separate discussion, but is it still 'valid' to 'force' students to use perl? Okay the language was hugely important during the human genome project, but aren't those times changing?
I am not sure we should start with this discussion again. Perl is an extremely stable, high performance and well debugged programming language with the most comprehensive set of Bio::* libraries. Large pipelines such as miRdeep and Trinity are implemented in Perl.
Absolutely. For that change to go to completion younger folk like you need to get their PhD's quickly and replace old perl users.
On a more serious note if you are learning on your own then you are free to choose the latest and greatest but if you are taking a class then you don't have much choice.
Fast cheap hardware/memory has made programmer's job easy. See what Margaret Hamilton had to work with.
sir its not a homework its a part of some research work in which i was stuck upon.
Then why did you absolutely have to do it in perl?
Maybe Perl is his main programming languages.
Absolutely,bcoz the thing m working on I started on with perl and dis was a part of it I had to continue in perl programming only! Anyways thankyou all for helping through it!
You should have made your question more clear, so you an get help more quickly. If you tell this at the very beginning, I can tell you this simplest way, which can save you and others' time:
Using my lovely SeqKit and csvtk:
Step 1. convert the list file containing motif sequences to FASTA format
Step 2. locate motifs and count
You can do this in Windows/Linux/Mac, because both seqkit and csvtk support them.
Oh m sorry next time if I get stuck ill try to make questions more clear!:)