Question

perl programming query

0

Entering edit mode

8.0 years ago

kriti.awasthi23 ▴ 10

I want to match my motif file to the set of sequences and get the desired output like this:

AAAAAA     2  
AAAAAT     1  
AAAAAG    1  
AAAAAC     0

the perl code I'm using is

open(FILE,"dataset1.txt");
%jobs_list = <FILE>;
@indexes=keys%jobs_list;

@seq=values%jobs_list;

open(IN,"Motif6.txt");
@cis = <IN>;
$count=0;
for ($i=0;$i<=$#seq;$i++) {
    foreach $index (@cis) {
    if($seq[$i] =~/$index/) {
            $count++;
        }
    }
    print "$index\t$count \n";
}

the data set file is -

>1
AAAAAAATGCATCGATCGAC
>2
AAAAAAGTCGATCGATCAGC

and the Motif6 file is-

AAAAAA

AAAAAT

AAAAAG

AAAAAC

Please suggest me what can I do to make it correct and get the desired output.

perl motif sequence • 4.0k views

ADD COMMENT • link updated 8.0 years ago by zjhzwang ▴ 180 • written 8.0 years ago by kriti.awasthi23 ▴ 10

2

Entering edit mode

Start with this:

  use strict;
  use warnings;

Edit: And don't forget that when reading lines from a file, you're also reading the end of line character(s).

ADD REPLY • link 8.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I've done what I can to fix the formatting.

ADD REPLY • link 8.0 years ago by Devon Ryan 105k

0

Entering edit mode

how i can see the formatting?

ADD REPLY • link 8.0 years ago by kriti.awasthi23 ▴ 10

0

Entering edit mode

Do you want to do it in PERL only? you can get this done using simple grep command

ADD REPLY • link 8.0 years ago by Prasad ★ 1.6k

0

Entering edit mode

sir i have to do it in perl only!!!

ADD REPLY • link 8.0 years ago by kriti.awasthi23 ▴ 10

0

Entering edit mode

i see, it's a homework:)

ADD REPLY • link 8.0 years ago by shenwei356 8.7k

1

Entering edit mode

An entire separate discussion, but is it still 'valid' to 'force' students to use perl? Okay the language was hugely important during the human genome project, but aren't those times changing?

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

2

Entering edit mode

I am not sure we should start with this discussion again. Perl is an extremely stable, high performance and well debugged programming language with the most comprehensive set of Bio::* libraries. Large pipelines such as miRdeep and Trinity are implemented in Perl.

ADD REPLY • link 8.0 years ago by Michael 55k

1

Entering edit mode

Absolutely. For that change to go to completion younger folk like you need to get their PhD's quickly and replace old perl users.

On a more serious note if you are learning on your own then you are free to choose the latest and greatest but if you are taking a class then you don't have much choice.

Fast cheap hardware/memory has made programmer's job easy. See what Margaret Hamilton had to work with.

ADD REPLY • link 8.0 years ago by GenoMax 148k

0

Entering edit mode

sir its not a homework its a part of some research work in which i was stuck upon.

ADD REPLY • link 8.0 years ago by kriti.awasthi23 ▴ 10

0

Entering edit mode

Then why did you absolutely have to do it in perl?

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Maybe Perl is his main programming languages.

ADD REPLY • link 8.0 years ago by shenwei356 8.7k

0

Entering edit mode

Absolutely,bcoz the thing m working on I started on with perl and dis was a part of it I had to continue in perl programming only! Anyways thankyou all for helping through it!

ADD REPLY • link 8.0 years ago by kriti.awasthi23 ▴ 10

0

Entering edit mode

You should have made your question more clear, so you an get help more quickly. If you tell this at the very beginning, I can tell you this simplest way, which can save you and others' time:

Using my lovely SeqKit and csvtk:

Step 1. convert the list file containing motif sequences to FASTA format

$ csvtk -H -t mutate Motif6.txt | seqkit tab2fx > motifs.fa

Step 2. locate motifs and count

$ seqkit locate -f motifs.fa dataset1.txt  | csvtk -t uniq -f 1,2 | csvtk -t freq -f 2
patternName     frequency
AAAAAT  1
AAAAAA  2
AAAAAG  1

You can do this in Windows/Linux/Mac, because both seqkit and csvtk support them.

ADD REPLY • link 8.0 years ago by shenwei356 8.7k

0

Entering edit mode

Oh m sorry next time if I get stuck ill try to make questions more clear!:)

ADD REPLY • link 8.0 years ago by kriti.awasthi23 ▴ 10

score 2 · Answer 1 · 2016-12-09

Perl is difficult to learn for beginners without any programming experiment.

Maybe we can paste some code of some basic tasks like reading FASTA sequences to give some examples and encourage beginners to make the first step.

Reading code of others is also an important way of learning programming language.

It's really difficult to debug others' Perl code. Here's a rewritten version with detailed comments.

Seqs:

$ cat dataset1.txt
>1
AAAAAAATGC
ATCGATCGAC
>2
AAAAAAGTCG
ATCGATCAGC

Motifs:

$ cat Motif6.txt
AAAAAA
AAAAAT
AAAAAG
AAAAAC

Code:

#!/usr/bin/env perl
use strict;

# read motifs
my $file_motif = "Motif6.txt";
my @motifs = (); # an array to store motifs
open my $fh, "<", $file_motif or die "fail to open file: %file_motif";
for my $line (<$fh>) {
    $line =~ s/\r?\n//g;    # chomp "\r" and "\n"
    next if $line eq "" or $line =~ m/\s+/; # skip blank line
    push @motifs, $line; # append to array
}
close $fh;

# read sequences and count motifs
my $file_seq = "dataset1.txt";
my %counter = (); # a hashmap to store countings

open my $fh, "<", $file_seq or die "fail to open file: $file_seq";
my ($header, $seq) = ("", "");
for my $line (<$fh>) {
    $line =~ s/\r?\n//g; # chomp "\r" and "\n"
    if (substr( $line, 0, 1 ) eq '>') { # FASTA header line
        unless ($header eq "" and $seq eq "") { # previous sequence
            &count_motifs($header, $seq);
        }

        $header = substr ($line, 1); # update header
        $seq = ""; # reset seq
    } else { # sequence line
        $seq = $seq . $line # concatenate sequence
    }
}
close $fh;

unless ($header eq "" and $seq eq "") { # do not forget the last sequence
    &count_motifs($header, $seq);
}

# print result, in decending order of count
for my $motif (sort {$counter{$b} <=> $counter{$a}} keys %counter) {
    print "$motif\t$counter{$motif}\n";
}

sub count_motifs {
    my ($header, $seq) = @_;
    for my $motif (@motifs) {
        my $n = 0;
        my $len_motif = length $motif;
        my ($begin, $end) = (-1, -1);

        # if you just want to test whether the seq contains a motif, use:
        $n++ if $seq =~ m/$motif/i;

        # if you want to find all matches in sequences, use:
        # while ($seq =~ m/$motif/gi) { # use regular expression to locate motifs
        #     my $pos = pos $seq; # http://perldoc.perl.org/functions/pos.html
        #     ($begin, $end) = ($pos - $len_motif + 1, $pos);
        #     print STDOUT "seq: $header\tmotif: $motif\tlocation: $begin-$end\n";
        #     pos $seq = $pos - $len_motif + 1;
        #     $n++;
        # }

        $counter{$motif} += $n;
    }
}

Result:

$ perl count_motifs_in_seqs.pl
AAAAAA  2
AAAAAG  1
AAAAAT  1
AAAAAC  0

score 1 · Answer 2 · 2016-12-09

Some comments:

As others have pointed out: use strict; use warnings
Your program processes fasta files correctly, if and only if they contain
- exactly one line of sequence per 1 line of header
- no blank lines, while your fasta reading by reading into a hash looks like a neat hack, fasta files cannot be read correctly like this.
The minimum for any file parsing is at least to chomp, and to skip blank lines. That also applies to your Motif file where you have blank lines, and they will be searched in the sequence.
Do not use C style for loops with incremental counters to iterate, use an iterator instead: foreach or map
Pattern matching: what to count if a motif matches a pattern multiple times?

Logic of counting and iteration does not combine, you defined only a global counter, but your loop logic does not fit to this. You can easily see that the following code will output an ever increasing number for each combination of motif and target sequence:

# this is pseudocode of what you are trying:
0: counter := 0
1: forall target in target sequences 
2:   forall motif in motifs
       if (motif matches target): 
           increment (counter)
       print "motif occurs in" counter "sequences at least once (NOT)"

You need to change the sequence of loop control in line 0-2 to suit your problem (hint: 2,0,1 looks like a good sequence)

score 1 · Answer 3 · 2016-12-09

1

Entering edit mode

8.0 years ago

shenwei356 8.7k

Check this post counting motifs in dna sequence, no programming needed. You might skip the step 1.

ADD COMMENT • link 8.0 years ago by shenwei356 8.7k

0

Entering edit mode

I am guessing OP just wants or needs to learn perl, or it is some kind of homework.

ADD REPLY • link 8.0 years ago by Michael 55k

0

Entering edit mode

ok, let's leave him/her alone :)

ADD REPLY • link 8.0 years ago by shenwei356 8.7k

0

Entering edit mode

I think if someone know the answer then he should tell that even if he knows that someone is asking for homework, it can be beneficial in many ways

If someone gets an answer using these sites then it will build a confidence in him to use these sites in future for a bigger project
A solved problem can also become a problem to someone else who can then search it and easily find the answer

ADD REPLY • link 8.0 years ago by ammarsabir15 ▴ 70

2

Entering edit mode

I disagree. We're all volunteers here and we are never obliged to give an answer. Furthermore, homework is supposed to learn you something. In my opinion, the best way to learn how to program is to struggle and fight with the code until it does what you want to. Building confidence to use these sites by answering homework questions seems like a pointless way, besides, we don't need to market. If you like biostars, you are welcome. But you are not a customer.

Your last reason might be true occasionally, but definitely not for this question. This text and title are far too aspecific.

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

2

Entering edit mode

I also disagree. In addition to providing practical experience in coding, homework also provides the instructor feedback on your mastery of the material. This feedback allows him/her to provide additional assistance when needed. It also allows the assessment of a grade/score for the course, which is an (admittedly rough) indicator of your proficiency to third parties.

ADD REPLY • link 8.0 years ago by harold.smith.tarheel ★ 5.0k

score 0 · Answer 4 · 2016-12-11

0

Entering edit mode

8.0 years ago

zjhzwang ▴ 180

You can use hash to store your motif,and the values are the numbers of motif.If match one motif,Its value should +1.

#! /usr/bin/env perl
open IN,"PATH" || die "$!";
#AAAAAAATGCATCGATCGAC
#AAAAAAGTCGATCGATCAGC
open MOTIF,"PATH" || die "$!";
#AAAAAA
#AAAAAT
#AAAAAG
#AAAAAC
%motif=();
while(defined($t=<MOTIF>)){
    chomp($t);
    $motif{$t}=0;
}
while(defined($line=<IN>)){
    chomp($line);
    foreach (keys %motif){
        my $count=$line=~m/$_/;
        $motif{$_}+=$count;
    }
}
@motif=keys %motif;
foreach (@motif){
    print "$_\t$motif{$_}\n";
}
close IN;
close MOTIF;

And the result:

AAAAAG  1
AAAAAT  1
AAAAAA  2
AAAAAC  0

ADD COMMENT • link 8.0 years ago by zjhzwang ▴ 180

0

Entering edit mode

thankyou sir, well i did it another way using a reg exp!!

ADD REPLY • link 8.0 years ago by kriti.awasthi23 ▴ 10

0

Entering edit mode

Congratulations! If you don't mind please show me the reg exp,I'm interested for the way you used.

ADD REPLY • link 8.0 years ago by zjhzwang ▴ 180