Hello everyone, I created a fasta file from a ChIP-seq_summit.bed where each sequence is about 500bp. I want to remove sequences containing a specific motif, so I thought to use this code:
grep -E ".[GC][AT]T[GC]AAA[GA]" ChIP-seq_summit.fa -v >ChIP-seq_summit_no_cons.fa
it's not clear to me wether this remove only the motif in the sequences or the entire line that belongs to.
I'm asking because by using the first file for finding motifs, MEME works fine but using this modified file it says that there are sequences shorter then the minimum allowed
Thank you :)
grep
is not right function to manipulate fasta esp multiline fasta. Use dedicated tools such asseqkit
andgrep
function is offered by seqkit.Hi cpad0112,
What is the seqkit way to extract or eliminate sequences containing a motif like in my case?
thanks
Input:
output with grep:
output with seqkit to extract sequences with pattern:
output with seqkit to extract sequences without user pattern:
Replace
AAA
withA{3}
s - search only sequences (not headers)
r- use regex
p - regex pattern
v - inverse
grep by default removes the line, but if your sequence is in Fasta, you are removing only the lines where it matches, example:
will be filtered as:
Hi JC, thank you for the answer, do you know how can I remove the entire sequence?
thanks
Linearize the fasta sequence then use
grep
to remove the entire sequence and then reformat remaining lines back.Code courtesy of @Pierre:
Hi Genomax
Do I have to do this sequentially? like
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa
grep -E ".[GC][AT]T[GC]AAA[GA]" input.fa > input.fa
the format back to fasta is not clear to me :/
thanks for help
You can use a unix pipe to do the following operations
If you use
seqkit grep
then you would not need to linearize/reformat back.