Question

How to match fasta header list of name?

0

Entering edit mode

5.5 years ago

fec2 ▴ 50

Hi,

I have a multi fasta file that contains 453 fasta sequences that looks like the following:

>M.Bce12308ORF4755P   GTWWAC 
ATGCGTGACCTGATCGAAGAGCCGGGCGGCGGCGCCGCGAGCGAGGCGGAGGCGGTTCAGCCCGCCGCTGCCGTGCCGCGCGCGCTGCCGTCCGGTATCG

>M.Bce1254ORF9725P   GTWWAC
ATGCGTGACCTGATCGAAGACCCGGGCGGCGGCGCCGCGAGCGAGGCGGAGGCGGTTCAGCCCGCCGCTGCCGTGCCGCGCGCGCTGCCGTCCGGTATCG

And I have sequence name list that contains 461 name as below:

M.Bce12308ORF4755P
M.Bce122ORF1082P
M.Bce12308ORF4755P
M.Bce1254ORF9725P

May I know how to match the name list to the fasta file, so that I can know which of the sequence from the name list is missing in the fasta file?

Thank you!

Felix

sequence • 3.3k views

ADD COMMENT • link updated 5.5 years ago by shenwei356 8.7k • written 5.5 years ago by fec2 ▴ 50

0

Entering edit mode

Did you try searching the forum at all? This is one of the most widely addressed problems here.

ADD REPLY • link 5.5 years ago by Ram 44k

0

Entering edit mode

Hi, I have tried searching it, but couldn't found exactly same issue, as what I want is a list of name of the missing fasta file. Do you have any idea what is the key word should I use to search for this issue? Thanks.

ADD REPLY • link 5.5 years ago by fec2 ▴ 50

4

Entering edit mode

ADD REPLY • link 5.5 years ago by Joe 21k

1

Entering edit mode

exactly same issue

Although I highly doubt you could not find the exact same case (I recall addressing multi-part sequence identifiers and how to deal with them a few years ago), this approach is not helpful. What you need is not something that you can copy-and-paste and "just works" - such solutions are rare and don't teach us anything. You need a "pointer", which is a hint that takes you one step closer to a solution than you are right now. That way, you get to solve the problem yourself while overcoming an obstance that might have taken quite some time to solve on your own.

ADD REPLY • link 5.5 years ago by Ram 44k

4

Entering edit mode

5.5 years ago

Chirag Parsania ★ 2.0k

library(Biostrings)
library(tidyverse)


## get target seq names 
target_seq_names <- c("M.Bce12308ORF4755P", "M.Bce122ORF1082P", "M.Bce12308ORF4755P", "M.Bce1254ORF9725P")


## get seq names from fasta file 
fasta_seq_names <- Biostrings::readDNAStringSet("input.fasta") %>% 
        names() %>% ## get names 
        gsub(pattern = "\\s.*" ,replacement = "" ,x = .) ##  clean headers. remove stuff after first space

fasta_seq_names

[1] "M.Bce12308ORF4755P" "M.Bce1254ORF9725P"


## present in both 
intersect(fasta_seq_names , target_seq_names)

[1] "M.Bce12308ORF4755P" "M.Bce1254ORF9725P"

ADD COMMENT • link 5.5 years ago by Chirag Parsania ★ 2.0k

score 5 · Accepted Answer · 2019-07-01

5

Entering edit mode

5.5 years ago

shenwei356 8.7k

# IDs in seqs.fa
$ grep '^>' seqs.fa | awk '{print $1}' | sed 's/^>//'
M.Bce12308ORF4755P
M.Bce1254ORF9725P

# IDs not in list.txt
$ grep -w -v -f <(grep '^>' seqs.fa | awk '{print $1}' | sed 's/^>//') list.txt
M.Bce122ORF1082P