Question

about Grep the complete sequences containing a specific motif in a fasta file

0

Entering edit mode

7.2 years ago

taojincs ▴ 50

How to Grep the complete sequences containing a specific motif in a fasta file? Also, I want to include the lines beginning with a ">" before these target sequences.

The image is not shown so I will add this link of example because typing > in biostar is kinda misleading: https://drive.google.com/file/d/0B1pci7ps8bLganZXWFNFcWZGd1k/view?usp=sharing

An example is shown in the image:

sequence grep fasta linux • 3.0k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 7.2 years ago by taojincs ▴ 50

0

Entering edit mode

Test file:

$ cat test.fa 
>name1
AEDIA
>name2
ALKME
>name3
AAIII
>name4
kmetq

To extract all sequences with KME in them and one can ignore the case as well in the example code:

 $ seqkit grep -s -i -r -p KME test.fa 

>name2
ALKME
>name4
kmetq

Download seqkit here. -s = match only sequence; -r = pattern is regular expression; -i = ignore case; -p = search pattern

if fasta sequences are linearized (i.e all sequences are in a single line), then code would be:

$ grep -i -B 1 --no-group-separator kme test.fa 
>name2
ALKME
>name4
kmetq

ADD REPLY • link 7.2 years ago by cpad0112 21k

score 2 · Accepted Answer · 2017-09-27

First, you'd have to change your sequences so that the DNA is all in one line, without this step you'd miss possible motifs hits that have line breaks in them.

From Pierre Lindenbaum: A: Multiline Fasta To Single Line Fasta

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > one_line.fa

Then you can use grep -B 1 to get the hit with its preceding line, let's also use LC_ALL=C to speed things up:

LC_ALL=C grep -B 1 KME one_line.fa

that should print all sequence names and their sequence where 'KME' is present.