Hello all!
I am very new to bioinformatics so apologies if my question is basic. I assembled a metagenome and I obtained 3,799,736 contigs from this assembly. I want to perform dereplication on my contigs to eliminate redundant data. To do so, I generated a blastn database with makeblastdb, and queried my contigs to themselves to generate a blast.tsv file containing the hits of each contig. This part ran successfully.
Now, I want to cluster my contigs together based on 95% contig ANI and 85% coverage. To do so, I am using anicalc.py
and aniclust.py
. Then I will use this clusters to keep only the largest contig of each cluster. However, when I attempt to run my script, the output log tells me: 324046 sequences retained from fna
when it should be retrieving 3,799,736 sequences. Below is my script:
#!/bin/sh
#SBATCH ...stuff here
module load StdEnv/2023 python/3.12.4
module load StdEnv/2023 seqkit/2.5.1
module load samtools/1.20
module load StdEnv/2023 gcc/13.3 blast+/2.14.1
source derep_scripts/bin/activate
# Directories
working_dir=/home/bens/projects/ctb-shapiro/bens/prophage_induction/02_rerun_dereplicated_contigs_alinedata_11.10
contigs=/home/bens/projects/ctb-shapiro/bens/prophage_induction/01_rerun_megahit_output_alinedata_01.10/reran_concatenated_contigs_1000kb_11.10.fasta
python_dir=/home/bens/projects/ctb-shapiro/bens/scripts/python_scripts
# Make BLAST database
working_dir=/home/bens/projects/ctb-shapiro/bens/prophage_induction/rerun_dereplicated_contigs_alinedata_11.10
contigs=/home/bens/projects/ctb-shapiro/bens/prophage_induction/rerun_megahit_output_alinedata_01.10/reran_concatenated_contigs_1000kb_11.10.fasta
makeblastdb -in $contigs -dbtype nucl -out $working_dir/blast_db
blastn -query $contigs -db $working_dir/blast_db -outfmt '6 std qlen slen' -max_target_seqs 10000 -out $working_dir/blastn/blast.tsv -num_threads 40
echo "blastn done"
#Calculate pairwise ANI by combining local alignments between sequence pairs:
python $python_dir/anicalc.py -i $working_dir/blastn/blast.tsv -o $working_dir/ANI/ANI.tsv
#Cluster contigs based on 95% ANI + 85% coverage
python $python_dir/aniclust.py --fna $contigs --ani $working_dir/ANI/ANI.tsv --out $working_dir/clusters/clusters.tsv --min_ani 95 --min_tcov 85 --min_qcov 0
# Dereplication
# Since clustered contigs are together on one line in clusters.tsv, the representative contig is the only contig >> to vOTU_subset.txt
clusters=$working_dir/clusters/clusters.tsv
awk '{print $1}' $clusters > $working_dir/vOTUs_subset.txt
#Extract the dereplicated sequences.
samtools faidx $contigs -r $working_dir/vOTUs_subset.txt -o $working_dir/dereplicated_seqs/dereplicated_phages.fa
deactivate
I have tried a few things that did not work:
- Using older versions of modules
- Double checked my contigs directory to ensure that it contains 3,799,736 contigs
I am honestly very new to this and maybe I am missing something obvious. Maybe my blast.tsv file only contains 324,046 seqs, but I am not sure how to check this since the .tsv file contains a line for every possible hit, so the resulting file contains 3,907,720,561 lines. I am also not completely sure which part of my script is reading sequences from fna so Im not 100% sure where to look for my problem
Any help is so very much appreciated!
Thank you so much for the reply! I'm definitely going to look into this as now this seems like a very likely answer. However, when I compared my output log to other colleagues, their output logs state that they are retaining sequences from fna equal to the number of contigs. Ill discuss with my colleagues more about this, thank you for your help!