Hello all!
I am very new to bioinformatics so apologies if my question is basic. I assembled a metagenome and I obtained 3,799,736 contigs from this assembly. I want to perform dereplication on my contigs to eliminate redundant data. To do so, I generated a blastn database with makeblastdb, and queried my contigs to themselves to generate a blast.tsv file containing the hits of each contig. This part ran successfully.
Now, I want to cluster my contigs together based on 95% contig ANI and 85% coverage. To do so, I am using anicalc.py
and aniclust.py
. Then I will use this clusters to keep only the largest contig of each cluster. However, when I attempt to run my script, the output log tells me: 324046 sequences retained from fna
when it should be retrieving 3,799,736 sequences. Below is my script:
#!/bin/sh
#SBATCH ...stuff here
module load StdEnv/2023 python/3.12.4
module load StdEnv/2023 seqkit/2.5.1
module load samtools/1.20
module load StdEnv/2023 gcc/13.3 blast+/2.14.1
source derep_scripts/bin/activate
# Directories
working_dir=/home/bens/projects/ctb-shapiro/bens/prophage_induction/02_rerun_dereplicated_contigs_alinedata_11.10
contigs=/home/bens/projects/ctb-shapiro/bens/prophage_induction/01_rerun_megahit_output_alinedata_01.10/reran_concatenated_contigs_1000kb_11.10.fasta
python_dir=/home/bens/projects/ctb-shapiro/bens/scripts/python_scripts
# Make BLAST database
working_dir=/home/bens/projects/ctb-shapiro/bens/prophage_induction/rerun_dereplicated_contigs_alinedata_11.10
contigs=/home/bens/projects/ctb-shapiro/bens/prophage_induction/rerun_megahit_output_alinedata_01.10/reran_concatenated_contigs_1000kb_11.10.fasta
makeblastdb -in $contigs -dbtype nucl -out $working_dir/blast_db
blastn -query $contigs -db $working_dir/blast_db -outfmt '6 std qlen slen' -max_target_seqs 10000 -out $working_dir/blastn/blast.tsv -num_threads 40
echo "blastn done"
#Calculate pairwise ANI by combining local alignments between sequence pairs:
python $python_dir/anicalc.py -i $working_dir/blastn/blast.tsv -o $working_dir/ANI/ANI.tsv
#Cluster contigs based on 95% ANI + 85% coverage
python $python_dir/aniclust.py --fna $contigs --ani $working_dir/ANI/ANI.tsv --out $working_dir/clusters/clusters.tsv --min_ani 95 --min_tcov 85 --min_qcov 0
# Dereplication
# Since clustered contigs are together on one line in clusters.tsv, the representative contig is the only contig >> to vOTU_subset.txt
clusters=$working_dir/clusters/clusters.tsv
awk '{print $1}' $clusters > $working_dir/vOTUs_subset.txt
#Extract the dereplicated sequences.
samtools faidx $contigs -r $working_dir/vOTUs_subset.txt -o $working_dir/dereplicated_seqs/dereplicated_phages.fa
deactivate
I have tried a few things that did not work:
- Using older versions of modules
- Double checked my contigs directory to ensure that it contains 3,799,736 contigs
I am honestly very new to this and maybe I am missing something obvious. Maybe my blast.tsv file only contains 324,046 seqs, but I am not sure how to check this since the .tsv file contains a line for every possible hit, so the resulting file contains 3,907,720,561 lines. I am also not completely sure which part of my script is reading sequences from fna so Im not 100% sure where to look for my problem
Any help is so very much appreciated!