Question

How to subset the fasta sequence starting with "G"?

0

Entering edit mode

9.0 years ago

bright602 ▴ 50

Hi there,

If I have a fasta sequence as following:

chr3:181879479-181879497 CGTTCCTCCTGGCGAGAG chr3:181879488-181879506 TACTTATTTCGTTCCTCC chr3:181879507-181879525 GAGGAGTGGGCATGAGGA chr3:181879549-181879567 AACCCTAAATGTCAATTA

How do I extract the sequence starting with "G"

chr3:181879507-181879525 GAGGAGTGGGCATGAGGA

Thanks a lot.

genome sequence • 1.5k views

ADD COMMENT • link updated 9.0 years ago by Daniel ★ 4.0k • written 9.0 years ago by bright602 ▴ 50

0

Entering edit mode

It seems to me, it's better to deal with fasta-sequences, so add ">"-sign before 'chr'.

Start from the first ">" and read every sign until the next ">". Gaps play the role of a new line sign, don't they?

Make and open a new empty file. Write everything that has been read to this new file.

Check the first letter after the gap or spacer, " ". If this was 'G', save the file with "good"-current output, then continue.

If this was not 'G', don't save the file with the latest output.

ADD REPLY • link 9.0 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Check this already answered thread: extract sequences from fasta starting with a specific nucleotide

ADD REPLY • link 9.0 years ago by venu 7.1k

score 1 · Answer 1 · 2016-06-15

1

Entering edit mode

9.0 years ago

Daniel ★ 4.0k

Also couldn't resist. Although is it a poor formatting on upload and actually each space is a newline? If so:

grep -B 1 '^G' file >outfile

Otherwise, turn spaces into newlines first (sed -i 's/ /\n/g' file), then do that.

ADD COMMENT • link 9.0 years ago by Daniel ★ 4.0k

0

Entering edit mode

true. a sed + grep combination seems even more evident than a perl one-liner, plus it'll work on a valid fasta file if that would be the case:

sed 's/ /\n/g' inFile | grep -B1 '^G' >outFile

ADD REPLY • link 9.0 years ago by Jorge Amigo 14k

score 0 · Answer 2 · 2016-06-14

0

Entering edit mode

9.0 years ago

Jorge Amigo 14k

that is not fasta format
this sound like homework
couldn't resist solving it: perl -lne 'while (/(chr\S+\sG\S+)/g) { print $1 }' file

ADD COMMENT • link 9.0 years ago by Jorge Amigo 14k