Contig dereplication script not retrieving correct number of sequences
1
0
Entering edit mode
12 hours ago

Hello all!

I am very new to bioinformatics so apologies if my question is basic. I assembled a metagenome and I obtained 3,799,736 contigs from this assembly. I want to perform dereplication on my contigs to eliminate redundant data. To do so, I generated a blastn database with makeblastdb, and queried my contigs to themselves to generate a blast.tsv file containing the hits of each contig. This part ran successfully.

Now, I want to cluster my contigs together based on 95% contig ANI and 85% coverage. To do so, I am using anicalc.py and aniclust.py. Then I will use this clusters to keep only the largest contig of each cluster. However, when I attempt to run my script, the output log tells me: 324046 sequences retained from fna when it should be retrieving 3,799,736 sequences. Below is my script:

#!/bin/sh
#SBATCH ...stuff here


    module load StdEnv/2023 python/3.12.4
    module load StdEnv/2023 seqkit/2.5.1
    module load samtools/1.20
    module load StdEnv/2023 gcc/13.3 blast+/2.14.1

    source derep_scripts/bin/activate

# Directories
working_dir=/home/bens/projects/ctb-shapiro/bens/prophage_induction/02_rerun_dereplicated_contigs_alinedata_11.10
contigs=/home/bens/projects/ctb-shapiro/bens/prophage_induction/01_rerun_megahit_output_alinedata_01.10/reran_concatenated_contigs_1000kb_11.10.fasta
python_dir=/home/bens/projects/ctb-shapiro/bens/scripts/python_scripts


# Make BLAST database
working_dir=/home/bens/projects/ctb-shapiro/bens/prophage_induction/rerun_dereplicated_contigs_alinedata_11.10
contigs=/home/bens/projects/ctb-shapiro/bens/prophage_induction/rerun_megahit_output_alinedata_01.10/reran_concatenated_contigs_1000kb_11.10.fasta

makeblastdb -in $contigs -dbtype nucl -out $working_dir/blast_db
blastn -query $contigs -db $working_dir/blast_db -outfmt '6 std qlen slen' -max_target_seqs 10000 -out $working_dir/blastn/blast.tsv -num_threads 40

echo "blastn done"

#Calculate pairwise ANI by combining local alignments between sequence pairs:
python $python_dir/anicalc.py -i $working_dir/blastn/blast.tsv -o $working_dir/ANI/ANI.tsv


#Cluster contigs based on 95% ANI + 85% coverage
python $python_dir/aniclust.py --fna $contigs --ani $working_dir/ANI/ANI.tsv --out $working_dir/clusters/clusters.tsv --min_ani 95 --min_tcov 85 --min_qcov 0


# Dereplication
# Since clustered contigs are together on one line in clusters.tsv, the representative contig is the only contig >> to vOTU_subset.txt
clusters=$working_dir/clusters/clusters.tsv
awk '{print $1}' $clusters > $working_dir/vOTUs_subset.txt


#Extract the dereplicated sequences.
samtools faidx $contigs -r $working_dir/vOTUs_subset.txt -o $working_dir/dereplicated_seqs/dereplicated_phages.fa

deactivate 

I have tried a few things that did not work:

  • Using older versions of modules
  • Double checked my contigs directory to ensure that it contains 3,799,736 contigs

I am honestly very new to this and maybe I am missing something obvious. Maybe my blast.tsv file only contains 324,046 seqs, but I am not sure how to check this since the .tsv file contains a line for every possible hit, so the resulting file contains 3,907,720,561 lines. I am also not completely sure which part of my script is reading sequences from fna so Im not 100% sure where to look for my problem

Any help is so very much appreciated!

phage ANI metagenomics dereplication • 77 views
ADD COMMENT
0
Entering edit mode
6 hours ago
Mensur Dlakic ★ 28k

I think you are missing something obvious. As you highlighted above, your script says that 324046 sequences retained from fna. So it read in 3,799,736 contigs, performed the redundancy analysis, and retained ~10% non-redundant sequences (324,046).

ADD COMMENT

Login before adding your answer.

Traffic: 1244 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6