Question

Extracting Sequences After "Motif" & Between Motifs In Multifasta File

0

Entering edit mode

12.1 years ago

Raghul ▴ 200

Hi I want to extract sequences after a motif say "TTTTTAAAAA" from a multifasta file. I do not want the nucleotides before this keyword. Is it possible to extract nucleotides between 2 motifs with grep? eg. nucleotides between TTTTTAAAA & AAAATTTT. I tried with grep but I need the fasta headers also. Can anybody suggest a solution in grep (if possible) or perl or python.

thanx raghul

parsing • 4.5k views

ADD COMMENT • link updated 12.0 years ago by PoGibas 5.1k • written 12.1 years ago by Raghul ▴ 200

0

Entering edit mode

You can get a case with the motif found several times within a same sequence. How do you want to deal with that?

ADD REPLY • link 12.1 years ago by Manu Prestat 4.1k

0

Entering edit mode

Hello!, I would like to do something similar...did you find a way to complete your task?

ADD REPLY • link 8.4 years ago by etarisal • 0

score 1 · Answer 1 · 2013-04-02

I don't think it would be possible with grep but this can be done w/a regex in perl. Something along the lines of:

$line = "";
foreach(<FILE>) { #for every line of the file
  chomp;
  if($_[0] == ">") { #if line starts with >, it is a header so process the previous sequence
    if($line =~ /[TTTTTAAAAA([ACTGN]+)AAAATTTT/g) { #regex to match motif
      print "$1\n" #print sequence in between motif
    }
   $line = ""
    print "$_"; #print header
  }
  else {
    $line = $line.$_ #append sequence
  }
}
if($line =~ /[ACTGN]*TTTTTAAAAA([ACTGN]+)AAAATTTT/g) {
  print "$1\n"
}

or something like that, (warning above code is untested and should be treated as pseudocode)

score 1 · Answer 2 · 2013-05-10

1

Entering edit mode

12.0 years ago

PoGibas 5.1k

grep way

  echo NNNTTTTTAAAACCCAAAATTTTNNN > sequence
  grep -o TTTTTAAAA[A-Z]*AAAATTTT sequence 
  TTTTTAAAACCCAAAATTTT

ADD COMMENT • link 12.0 years ago by PoGibas 5.1k