Question

extract sequences from fasta starting with a specific nucleotide

0

Entering edit mode

8.7 years ago

Chris ▴ 30

Hi, have a fasta file of 500 sequences, I want to extract the sequences starting with a specific nucleotide in the example A. Can anyone help?

E.g.

>seq_1
ACACACCGCTTCTAGCTG
>seq_2
ACAGGCAGAATTCTACAAGGA
>seq_3
CAAATATAATGACTATGGAATACC
>seq_4
CAATCGCCCGCTCACCTAGGTCT
>seq_5-493
TAACAGGCAGAATTCTACAAGGAC

Desired output:

>seq_1
ACACACCGCTTCTAGCTG
>seq_2
ACAGGCAGAATTCTACAAGGA

thank you in advance for your help

next-gen RNA-Seq sequence • 3.5k views

ADD COMMENT • link 8.7 years ago by Chris ▴ 30

0

Entering edit mode

Thank you all for your answers. Both methods are working!

ADD REPLY • link 8.7 years ago by Chris ▴ 30

score 2 · Answer 1 · 2016-04-27

2

Entering edit mode

8.7 years ago

venu 7.1k

Something like following should work

grep '^A' -B 1 file.fa | sed '/--/d' > new_file.fa

Update: (Credits - Pierre)

grep '^A' -B 1 --no-group-separator file.fa > new_file.fa

ADD COMMENT • link 8.7 years ago by venu 7.1k

2

Entering edit mode

A yes, grep -B1 ! :-) , you know there is a secret option in grep to remove the double hyphen: --no-group-separator

ADD REPLY • link 8.7 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Aww..this is cool. Thank you Pierre.

ADD REPLY • link 8.7 years ago by venu 7.1k

0

Entering edit mode

--no-group-separator is not available in all implementations of grep. So having a sed/grep -v to exclude separators is a safe bet.

ADD REPLY • link 8.7 years ago by GenoMax 148k

0

Entering edit mode

It is not there in man page also. (grep (GNU grep) 2.16).

ADD REPLY • link 8.7 years ago by venu 7.1k

score 0 · Answer 2 · 2016-04-27

0

Entering edit mode

8.7 years ago

Pierre Lindenbaum 164k

assuming there are only 2 lines per record:

cat input.fa |paste - - | awk  '($2 ~ /^A/)' | tr "\t" "\n"