Find sequences in a fast files present in another fasta file
2
0
Entering edit mode
3.9 years ago
lorenzinip • 0

Hi all

I have a fasta fileA.fasta which contains 15 sequences and another fasta fileB.fasta which contains 92 sequences. I would like to find how many of the 15 sequences are also found amongst the 92.

The format (for both files is the same) is like this:

>ENST000006553_1908 
AGCGGGGCCCTT

>ENST000002542_1826 
GGGCCTAAAATT

...and so on

sequence • 791 views
ADD COMMENT
1
Entering edit mode

Do the sequences match exactly? Also, is there any match in the names?

ADD REPLY
2
Entering edit mode
3.9 years ago
5heikki 11k

Exact sequence match, headers don't need to match, no linebreaks in sequences:

join -1 2 -2 2 -t $'\t' \
    <(paste -d $'\t' - - < f1.fasta | sort -t $'\t' -k2,2) \
    <(paste -d $'\t' - - < f2.fasta | sort -t $'\t' -k2,2)
ADD COMMENT
0
Entering edit mode
3.9 years ago

with seqkit:

   $ seqkit common -si input1.fa input2.fa

order of the files doesn't matter.

Check if following awk script works:

$ awk -v RS=">" -v OFS="\n"  'NF > 1 && NR==FNR {a[$1,$2];next} ($1,$2) in a {print ">"$1,$2}' input1.fa input2.fa

if fasta file is flattened and sequence and IDs are identical in both the files, following awk script should work:

$ awk  'FNR==NR {a[$1]=$0; next}; $1 in a  {print $1}' input1.fa input2.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6