Question

how to sort unique seq from fasta files

0

Entering edit mode

19 months ago

Sapphire ▴ 10

Hi, i have 300 fasta files and each file contain 3000-4000 amino acid seq, i want to know statistics of each file like how many number of common and unique seq are there in each fasta files?

Thank You!

fasta • 900 views

ADD COMMENT • link updated 19 months ago by shenwei356 8.7k • written 19 months ago by Sapphire ▴ 10

score 2 · Answer 1 · 2023-06-05

2

Entering edit mode

19 months ago

Mensur Dlakic ★ 28k

CD-HIT will remove all sequences that share identity above a certain threshold. If you set that threshold at 1.0 (meaning 100% identity), it will remove all identical sequences and retain only one representative:

cd-hit -i my_file.fasta -o my_file_nonredundant.fasta -c 1.0

ADD COMMENT • link 19 months ago by Mensur Dlakic ★ 28k

score 2 · Answer 2 · 2023-06-05

seqkit common finds common sequences of multiple files by id/name/sequence

seqkit common --by-seq  --ignore-case  --only-positive-strand \
    --infile-list <(find dir/ -name "*.fasta" ) -o common.fa

seqkit grep can exlude a list of records via ID/name/sequence/sequence, common sequences here.

# common sequences, one record per line
seqkit seq --seq --line-width 0 common.fa -o common.fa.txt

# output dir
mkdir -p uniq

for f in dir/*; do
    b=$(basename $f);
    seqkit grep --by-seq --only-positive-strand --invert-match \
        --pattern-file common.fa.txt $f -o uniq/$b
done

The method above compares sequence by the whole bases/amino acids (exact match), you may also use clustering methods, which might be more reasonable for proteins.