Question

How to print 2 lines if first has "*" in it

0

Entering edit mode

8.3 years ago

bastianfromm • 0

I have a long file (fasta) like this:

>Eha-Novel-38_5p 
ACCCATTTTCGTCTGAGGATAAT
>Eha-Novel-38_3p* 
TTTCCCAGACCCAAATGGGTGC
>Eha-Novel-46_3p 
AATGGCGGCCTGATATCCCGGA
>Eha-Novel-46_5p* 
TGGGGTATTAAGCCGCGATTGT
>Eha-Novel-44_3p 
TATCACAGTCATTTACGGGTAC
>Eha-Novel-44_5p* 
TCCCGTATTTGACTGTGACTGAG

I want to print only lines without the "*" and its following line.

Desired output:

>Eha-Novel-38_5p 
ACCCATTTTCGTCTGAGGATAAT
>Eha-Novel-46_3p 
AATGGCGGCCTGATATCCCGGA
>Eha-Novel-44_3p 
TATCACAGTCATTTACGGGTAC

I tried using grep "*" -v -A 1 FILE, but that did not work.

Thanks for your help.

fasta grep • 3.5k views

ADD COMMENT • link updated 8.3 years ago by Daniel ★ 4.0k • written 8.3 years ago by bastianfromm • 0

0

Entering edit mode

Maybe you can try:

cat FILE | grep "\*" >OUTPUT

ADD REPLY • link 8.3 years ago by zjhzwang ▴ 180

0

Entering edit mode

This way you wouldn't have the sequence, just the identifier.

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

You are right.Maybe sed works.sed -n -e "/\p$/ {p;n;p}" FILE >OUTPUT

ADD REPLY • link 8.3 years ago by zjhzwang ▴ 180

1

Entering edit mode

Everyone needs to learn the before and after flags for grep! Just do -A 1 on the grep:

grep -A 1 "\*" file.fasta >OUTPUT.fasta

-A for lines After
-B for lines Before
-C for lines before and after

ADD REPLY • link 8.3 years ago by Daniel ★ 4.0k

0

Entering edit mode

I tried using grep "*" -v -A 1 FILE, but that did not work.

The OP knew about -A, it just doesn't work well with -v.

ADD REPLY • link 8.3 years ago by Carlo Yague 9.0k

2

Entering edit mode

There's no need to do an inverse search, just search for lines ending 'p'.

ADD REPLY • link 8.3 years ago by Daniel ★ 4.0k

score 3 · Answer 1 · 2016-12-20

3

Entering edit mode

8.3 years ago

Daniel ★ 4.0k

I think everyone's overthinking this... to select only the non '*' ending headers (i.e. the lines ending in p) just do:

grep -A 1 ">.*p$" file.fasta >output.fasta

Explanation: 
- line starts as a fasta (>)
- has any amount of characters (.*)
- ends with a p (p$)
Then also print the line after it (-A 1).

Edit: To be honest, there's no need for the complicated regex, as we know there's not going to be a 'p' in any sequence lines. So this would work too:

grep -A 1 "p$" file.fasta >output.fasta

ADD COMMENT • link 8.3 years ago by Daniel ★ 4.0k

0

Entering edit mode

better adding --no-group-separator

ADD REPLY • link 8.3 years ago by shenwei356 8.7k

score 2 · Answer 2 · 2016-12-20

I tried to use shell grep, but it's hard to do this.

Try the grep (usage) of SeqKit, just download the executable binary file and run:

./seqkit grep -r -p "\*" -v FILE
>Eha-Novel-38_5p 
ACCCATTTTCGTCTGAGGATAAT
>Eha-Novel-46_3p 
AATGGCGGCCTGATATCCCGGA
>Eha-Novel-44_3p 
TATCACAGTCATTTACGGGTAC

Long-option version:

./seqkit grep --use-regexp --pattern "\*" --invert-match FILE

score 1 · Answer 3 · 2016-12-20

Hi,

It looks like grep invert search (-v) and context (-A) are not working well together. I found one-liner solutions with sed and awk but here is a less elegant but probably simpler solution using only grep :

grep ">" FILE | grep -v "*" | grep -f - FILE -A 1

This first look for headers in your FILE, then select those without * and use them to search the file again. If speed is not an issue, I guess this is an ok solution.

PS : If you want to remove the "- -" from the output, you can do it with one more grep (or awk, sed or whatever you like).

grep ">" FILE | grep -v "*" | grep -f - FILE -A 1 | grep -v "\-\-"

score 1 · Answer 4 · 2016-12-20

1

Entering edit mode

8.3 years ago

5heikki 11k

awk '{if(/^>/ && ! /\*$/){getline var; print $0"\n"var}}' FILE

If line starts with ">" and doesn't end in "*", get the next line into var, print the current line, linebreak, and var.

Also, OP please be more meticulous, the title, what you want, and desired output all differ. The above produces desired output (as long as there are no linebreaks in sequences).

ADD COMMENT • link 8.3 years ago by 5heikki 11k

0

Entering edit mode

This solution is simple, efficient, readable,... Love awk's magic.

ADD REPLY • link 8.3 years ago by Carlo Yague 9.0k

0

Entering edit mode

I modified it a little bit while you commented. I think it's more clear like this since print $0 doesn't get repeated. If somebody cares, it was like this before:

awk '{if(/^>/ && ! /\*$/){print $0; getline; print $0}}' FILE

If line starts with ">" and doesn't end in "*", print the line, get the next line, print it.

edit. Maybe this is more clear still:

awk '{if(/^>/ && ! /\*$/){print $0; print $(getline)}}' FILE

If line starts with ">" and doesn't end in "*", print the line, print the next line (returned by $(getline)). I don't know if there's any real difference in speed, but I imagine this one is the fastest..

ADD REPLY • link 8.3 years ago by 5heikki 11k

score 0 · Answer 5 · 2016-12-20

0

Entering edit mode

8.3 years ago

michael.ante ★ 4.0k

Hi Bastianfromm,

As zjhzwang mentioned, you need to escape the "*", otherwise the standard wild card character greps everything. Afterwards, you need to clean the separators, grep is inserting:

grep -A 1 "\*" in.fa | sed '/--/d' > out.fa

Cheers, Michae

ADD COMMENT • link 8.3 years ago by michael.ante ★ 4.0k

0

Entering edit mode

I think you can avoid the separators using the --no-group-separator flag (see man page).

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Cheers but so I get the

>Eha-Novel-44_5p*
TCCCGTATTTGACTGTGACTGAG
>Eha-Novel-46_5p*
TGGGGTATTAAGCCGCGATTGT

While I would like the lines WITHOUT "*" and the 1 following

ADD REPLY • link 8.3 years ago by bastianfromm • 0

1

Entering edit mode

I guess the following would do the trick:

grep -v -A 1 --no-group-separator "\*" in.fa > out.fa

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

It failed. Combination of -A and -v can't work correctly.

ADD REPLY • link 8.3 years ago by shenwei356 8.7k

0

Entering edit mode

Hmm that makes sense, linearizing and back to two-lines would solve this probably.

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

nope :-( although it has no "--" between the greps

ADD REPLY • link 8.3 years ago by bastianfromm • 0

score 0 · Answer 6 · 2016-12-20

0

Entering edit mode

8.3 years ago

bastianfromm • 0

I solved it in two steps

grep --no-group-separator -e ">*\*" -v  FILEA|grep ">" > IDS_forgrep.txt
grep --no-group-separator -f IDS_forgrep.txt FILEA -A 1

and in one step

grep  --no-group-separator -e "p$" FILEA -A 1

ADD COMMENT • link 8.3 years ago by bastianfromm • 0