grep -v remove fasta sequence

2

Entering edit mode

4.4 years ago

Morris_Chair ▴ 370

Hello everyone, I created a fasta file from a ChIP-seq_summit.bed where each sequence is about 500bp. I want to remove sequences containing a specific motif, so I thought to use this code:

grep -E ".[GC][AT]T[GC]AAA[GA]" ChIP-seq_summit.fa -v >ChIP-seq_summit_no_cons.fa

it's not clear to me wether this remove only the motif in the sequences or the entire line that belongs to.

I'm asking because by using the first file for finding motifs, MEME works fine but using this modified file it says that there are sequences shorter then the minimum allowed

Thank you :)

ChIP-Seq • 2.6k views

ADD COMMENT • link 4.4 years ago by Morris_Chair ▴ 370

1

Entering edit mode

grep is not right function to manipulate fasta esp multiline fasta. Use dedicated tools such as seqkit and grep function is offered by seqkit.

ADD REPLY • link 4.4 years ago by cpad0112 21k

0

Entering edit mode

Hi cpad0112,

What is the seqkit way to extract or eliminate sequences containing a motif like in my case?

thanks

ADD REPLY • link 4.4 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Input:

$ cat test.fa                                      

> seq1
CCCCCCCCCC
CCGAAGAAAG
GGGGGGGGG
GGATGAAAG
>seq2
CCGGCG

output with grep:

$ grep -E ".[GC][AT]T[GC]AAA[GA]" test.fa          

GGATGAAAG

output with seqkit to extract sequences with pattern:

$ seqkit grep -srp '.[GC][AT]T[GC]AAA[GA]' test.fa 
> seq1
CCCCCCCCCCCCGAAGAAAGGGGGGGGGGGGATGAAAG

output with seqkit to extract sequences without user pattern:

$ seqkit grep -vsrp '.[GC][AT]T[GC]AAA[GA]' test.fa 

>seq2
CCGGCG

Replace AAA with A{3}

s - search only sequences (not headers)

r- use regex

p - regex pattern

v - inverse

ADD REPLY • link 4.4 years ago by cpad0112 21k

0

Entering edit mode

grep by default removes the line, but if your sequence is in Fasta, you are removing only the lines where it matches, example:

> seq1
CCCCCCCCCC
CCGAAGAAAG
GGGGGGGGG

will be filtered as:

> seq1
CCCCCCCCCC
GGGGGGGGG

ADD REPLY • link 4.4 years ago by JC 13k

0

Entering edit mode

Hi JC, thank you for the answer, do you know how can I remove the entire sequence?

thanks

ADD REPLY • link 4.4 years ago by Morris_Chair ▴ 370

1

Entering edit mode

Linearize the fasta sequence then use grep to remove the entire sequence and then reformat remaining lines back.

Code courtesy of @Pierre:

Linearize a fasta sequence

awk -f linearizefasta.awk < input.fa

or

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

Format back to fasta

tr "\t" "\n" < linearized.tsv

if you know your fasta header have a length < 60

tr "\t" "\n" < linearized.tsv | fold -w 60

view raw README.md hosted with ❤ by GitHub

	/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;}
	{printf("%s",$0);}
	END {printf("\n");}

view raw linearizefasta.awk hosted with ❤ by GitHub

ADD REPLY • link 4.4 years ago by GenoMax 150k

0

Entering edit mode

Hi Genomax

Do I have to do this sequentially? like

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

grep -E ".[GC][AT]T[GC]AAA[GA]" input.fa > input.fa

the format back to fasta is not clear to me :/

thanks for help

ADD REPLY • link 4.4 years ago by Morris_Chair ▴ 370

1

Entering edit mode

You can use a unix pipe to do the following operations

Linearize | grep -v | reformat back to fasta

If you use seqkit grep then you would not need to linearize/reformat back.

ADD REPLY • link 4.4 years ago by GenoMax 150k

1

Entering edit mode

4.4 years ago

cpad0112 21k

If you use seqkit, you can also use IUPAC compliant motifs in sequence search. Thanks to shenwei356 for pointing this on github. Following is the example code with OP motif:

input:

$ cat test.fa                           
>seq1
CCC
>seq2
G
> seq1
CCCCCCCCCC
CCGAAGAAAG
GGGGGGGGG
GGATGAAAG
>seq2
CCGGCG

output sequences with OP input motif:

$ seqkit grep -rsp  '.[GC][AT]T[GC]AAA[GA]' test.fa
> seq1
CCCCCCCCCCCCGAAGAAAGGGGGGGGGGGGATGAAAG

output sequences with OP input motif using IUPAC code:

$ seqkit grep -dsp '.SWTSA{3}R' test.fa           
> seq1
CCCCCCCCCCCCGAAGAAAGGGGGGGGGGGGATGAAAG

output sequences without OP input motif using IUPAC code:

$ seqkit grep -vdsp '.SWTSA{3}R' test.fa          
>seq1
CCC
>seq2
G
>seq2
CCGGCG

ADD COMMENT • link 4.4 years ago by cpad0112 21k

Login before adding your answer.