Question

Extract sequence ID from Fasta for a given sequence

0

Entering edit mode

6.1 years ago

MAPK ★ 2.1k

I have a fasta file myfasta.fasta like this:

>aat.2.2344.a
ATTGCCGGTTTAATATTA
>aat.2.d2344.acc
ATTGCCGGTTTAATAAA
>aat.2.2bb344.a
ATTGCCGGTTTAATAGGAGAGAATT
>aat.2.2ccc344.a
ATTGCCGGTTTAATAGGGAG
>aat.2.2344.acc
ATTGCCGGTTTAATAAA

I also have a text file my.txt which contains the sequence that matches the sequence in fasta file above:

ATTGCCGGTTTAATAAA

Based on this sequence, I want to extract all matched IDs for this sequence. Can someone please help me with this? Thanks!

The result I want is:

>aat.2.2344.acc
>aat.2.d2344.acc

Fasta • 1.5k views

ADD COMMENT • link updated 6.1 years ago by michael.ante ★ 3.9k • written 6.1 years ago by MAPK ★ 2.1k

1

Entering edit mode

Are the sequences all one line? If so you can just use grep -B 1 ...

ADD REPLY • link 6.1 years ago by Joe 21k

0

Entering edit mode

Yes they are 50 bps reads.

ADD REPLY • link 6.1 years ago by MAPK ★ 2.1k

1

Entering edit mode

Dear MAPK, if you usually work with FASTA files you may find SEDA (http://www.sing-group.org/seda/) an useful tool. It has a great variety of operations to manipulate, filter, and transform FASTA files (check out the manual to see all of them: https://www.sing-group.org/seda/manual/index.html). It also allows you to explore a set of FASTA files and extract only the information you need, such as the sequence identifiers (see https://www.sing-group.org/seda/manual/graphical-user-interface.html#the-input-area).

With best regards, Hugo.

ADD REPLY • link 6.1 years ago by Hugo ▴ 380

score 2 · Accepted Answer · 2018-10-18

2

Entering edit mode

6.1 years ago

Joe 21k

This works:

grep --no-group-separator -B 1 -F -f my.txt 344194.fasta | grep -v -f my.txt

Only downside is reading the my.txt file twice. There are other non-grep approaches that could avoid this but this is simple.

ADD COMMENT • link 6.1 years ago by Joe 21k

score 2 · Accepted Answer · 2018-10-19

Hi MAPK,

I find fasgrep from the FAST suit quite handy:

fasgrep -s ATTGCCGGTTTAATAAA myfasta.fasta

It reports the full entry, thus you can just grep for ^\>.

[EDIT] since you have multiple sequences (as I have read right now) you can provide these as regex: "ATTGCCGGTTTAATAAA|CCCCGCGC|ATATATATA"

Cheers,

Michael