Hi, i have 300 fasta files and each file contain 3000-4000 amino acid seq, i want to know statistics of each file like how many number of common and unique seq are there in each fasta files?
Thank You!
Hi, i have 300 fasta files and each file contain 3000-4000 amino acid seq, i want to know statistics of each file like how many number of common and unique seq are there in each fasta files?
Thank You!
CD-HIT will remove all sequences that share identity above a certain threshold. If you set that threshold at 1.0
(meaning 100% identity), it will remove all identical sequences and retain only one representative:
cd-hit -i my_file.fasta -o my_file_nonredundant.fasta -c 1.0
seqkit common finds common sequences of multiple files by id/name/sequence
seqkit common --by-seq --ignore-case --only-positive-strand \
--infile-list <(find dir/ -name "*.fasta" ) -o common.fa
seqkit grep can exlude a list of records via ID/name/sequence/sequence, common sequences here.
# common sequences, one record per line
seqkit seq --seq --line-width 0 common.fa -o common.fa.txt
# output dir
mkdir -p uniq
for f in dir/*; do
b=$(basename $f);
seqkit grep --by-seq --only-positive-strand --invert-match \
--pattern-file common.fa.txt $f -o uniq/$b
done
The method above compares sequence by the whole bases/amino acids (exact match), you may also use clustering methods, which might be more reasonable for proteins.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.