Question

Find sequences in a fast files present in another fasta file

0

Entering edit mode

3.8 years ago

lorenzinip • 0

Hi all

I have a fasta fileA.fasta which contains 15 sequences and another fasta fileB.fasta which contains 92 sequences. I would like to find how many of the 15 sequences are also found amongst the 92.

The format (for both files is the same) is like this:

>ENST000006553_1908 
AGCGGGGCCCTT

>ENST000002542_1826 
GGGCCTAAAATT

...and so on

sequence • 778 views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 3.8 years ago by lorenzinip • 0

1

Entering edit mode

Do the sequences match exactly? Also, is there any match in the names?

ADD REPLY • link 3.8 years ago by rpolicastro 13k

0

Entering edit mode

3.8 years ago

cpad0112 21k

with seqkit:

   $ seqkit common -si input1.fa input2.fa

order of the files doesn't matter.

Check if following awk script works:

$ awk -v RS=">" -v OFS="\n"  'NF > 1 && NR==FNR {a[$1,$2];next} ($1,$2) in a {print ">"$1,$2}' input1.fa input2.fa

if fasta file is flattened and sequence and IDs are identical in both the files, following awk script should work:

$ awk  'FNR==NR {a[$1]=$0; next}; $1 in a  {print $1}' input1.fa input2.fa

ADD COMMENT • link 3.8 years ago by cpad0112 21k

score 2 · Accepted Answer · 2021-01-26

2

Entering edit mode

3.8 years ago

5heikki 11k

Exact sequence match, headers don't need to match, no linebreaks in sequences:

join -1 2 -2 2 -t $'\t' \
    <(paste -d $'\t' - - < f1.fasta | sort -t $'\t' -k2,2) \
    <(paste -d $'\t' - - < f2.fasta | sort -t $'\t' -k2,2)

ADD COMMENT • link 3.8 years ago by 5heikki 11k