how to sort unique seq from fasta files
2
0
Entering edit mode
19 months ago
Sapphire ▴ 10

Hi, i have 300 fasta files and each file contain 3000-4000 amino acid seq, i want to know statistics of each file like how many number of common and unique seq are there in each fasta files?

Thank You!

fasta • 898 views
ADD COMMENT
2
Entering edit mode
19 months ago
Mensur Dlakic ★ 28k

CD-HIT will remove all sequences that share identity above a certain threshold. If you set that threshold at 1.0 (meaning 100% identity), it will remove all identical sequences and retain only one representative:

cd-hit -i my_file.fasta -o my_file_nonredundant.fasta -c 1.0
ADD COMMENT
2
Entering edit mode
19 months ago

seqkit common finds common sequences of multiple files by id/name/sequence

seqkit common --by-seq  --ignore-case  --only-positive-strand \
    --infile-list <(find dir/ -name "*.fasta" ) -o common.fa

seqkit grep can exlude a list of records via ID/name/sequence/sequence, common sequences here.

# common sequences, one record per line
seqkit seq --seq --line-width 0 common.fa -o common.fa.txt

# output dir
mkdir -p uniq

for f in dir/*; do
    b=$(basename $f);
    seqkit grep --by-seq --only-positive-strand --invert-match \
        --pattern-file common.fa.txt $f -o uniq/$b
done

The method above compares sequence by the whole bases/amino acids (exact match), you may also use clustering methods, which might be more reasonable for proteins.

ADD COMMENT

Login before adding your answer.

Traffic: 1956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6