Metabat2 (binning package) Error: ReferenceFile: <filename>.fasta is not the same as in the bam headers! (targets: 10575 from the bam vs 67331 from the ref)
1
0
Entering edit mode
4 months ago

Hi,

I am currently trying to use metabat2 to bin shotgun reads to obtain MAGs. I has scaffolds.fasta files from assembly (metaSpades). And I had made .bam files with bowtie2 and samtools on the same fastq files as it is my understanding you need both the sorted bam file and the contig file to run a binning software package like metabat2.

However, when I run it on one sample (the scaffolds.fasta file and the sorted bam file) I get this error:

Error: referenceFile: scaffolds.fasta is not the same as in the bam headers! (targets: 10575 from the bam vs 67331 from the ref)

It is clear that they don't match and that the scaffolds.fasta has way more reads than the bam file right? I tried using the contigs.fasta as well with no avail.

Is there any hints as to why this could be? Is there a better way to get a contigs fasta file and a sorted bam file that match? Thanks! Here is the original code I ran:

runMetaBat.sh Torgos-tracheliotus_S_S_Temp_D703-AK1682/scaffolds.fasta Torgos-tracheliotus_S_S_Temp_D703-AK1682_sorted.bam

binning metagenomics metabat2 • 1.6k views
ADD COMMENT
1
Entering edit mode
4 months ago
Mensur Dlakic ★ 29k

I am currently trying to use metabat2 to bin shotgun reads to obtain MAGs.

The goal should be to bin the assembled contigs, not shotgun reads. I think the error is somewhere upstream of the command you listed.

Let's say that your reads file is reads.fastq and your contigs assembly file is scaffolds.fasta. First you map the reads onto the assembly using bowtie2, and get the sorted bam file sorted.bam using samtools.

Then your command might be:

runMetaBat.sh scaffolds.fasta sorted.bam

I say might be, because I don't know what exactly runMetaBat.sh does. Presumably that script will create a tab-delimited file of contig depths from the assembly and the bam file, followed by binning of the same assembly file.

You can run metabat2 without mapping the reads, just to make sure it works:

metabat2 -i Torgos-tracheliotus_S_S_Temp_D703-AK1682/scaffolds.fasta  -o bins

That should give you a bunch of binned files starting with bins.N where N is the bin number. This won't use the depth of coverage.

ADD COMMENT
0
Entering edit mode

Hi Mensur Dlakic, Thank you for the very detailed reply. I tried a test run on the first file and got 7 bins, so it seems to work. Does this mean 7 "MAGs" were binned for that sample but I necessarily won't know their coverage (i.e. how many hits per samples)?

And regarding getting coverage, is there a way you would redo where I started to ensure the sorted bam and scaffolds files have matching headers/#of reads?

ADD REPLY
1
Entering edit mode

Does this mean 7 "MAGs" were binned for that sample but I necessarily won't know their coverage (i.e. how many hits per samples)?

You will never get the information about coverage from the binning alone. You only get sequences that belong to a given bin. However, if you get a properly sorted .bam file from the assembly, there is a little utility called jgi_summarize_bam_contig_depths that will calculate the coverage for each contig. It comes with metabat2 distribution, and its use is described here.

This is in general how the mapping is done, and it must start from the same assembly file (scaffolds.fasta) that will later be used for binning. I am putting arbitrary numbers for total threads (20) and mapping with both paired and single reads, which may not be realistic. You will have to adjust the commands to your setup.

bowtie2-build scaffolds.fasta scaffolds.db
bowtie2 -x scaffolds.db -q --phred33 --very-sensitive --no-unal -p 20 -S file.sam -1 forward_reads.fastq -2 reverse_reads.fastq -U single_reads.fastq
samtools view -bS file.sam | samtools sort -o sorted.bam

Then presumably this command should work:

runMetaBat.sh scaffolds.fasta sorted.bam
ADD REPLY
0
Entering edit mode

Hi Mensur,

I tried your method and it seems to have work. Thank you for the help. I got 11 bins.

ADD REPLY
0
Entering edit mode

Sorry @mensur.. one last question: If I run checkM2 on the 11 bin files and get back a "0 binds found" output... does that just mean there aren't enough bins to get any genomes with? I am reading that a lot of people will combine all the bins of all samples first and then start looking for MAGs... but I am unsure if this is an accurate assessment.

ADD REPLY
0
Entering edit mode

I don't have much experience with checkM2, but it sounds like you may be giving a wrong directory location, or specifying a wrong file extension for the bins. Impossible to tell from the information you provided.

No, bins are not meant to be combined. If things have been done properly, bins in most cases are equivalent to MAGs.

For your future inquiries: at a minimum one needs the whole command and the whole error message. Without them, it becomes a guessing game.

ADD REPLY
0
Entering edit mode

Yeah that's what I was thinking but triple checked and it seems to be right. And yes of course here is the output for reference:

running ls to show bins in .fa format in the scaffolds.fasta.metabat-bins-20250120_181114 directory

 (checkm2) [sdegregori@b2-008]/ddn_scratch/sdegregori/songfastq/Torgos-tracheliotus_S_S_Temp_D703-AK1682% ls scaffolds.fasta.metabat-bins-20250120_181114                                                             
    bin.10.fa  bin.11.fa  bin.1.fa  bin.2.fa  bin.3.fa  bin.4.fa  bin.5.fa  bin.6.fa  bin.7.fa  bin.8.fa  bin.9.fa

and then running checkm on that directory

(checkm2) [sdegregori@b2-008]/ddn_scratch/sdegregori/songfastq/Torgos-tracheliotus_S_S_Temp_D703-AK1682% checkm
2 predict --threads 1 --input scaffolds.fasta.metabat-bins-20250120_181114 --output-directory checkout
/home/sdegregori/miniconda3/envs/checkm2/bin/checkm2:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('CheckM2==1.0.2', 'checkm2')
[01/21/2025 01:40:15 PM] INFO: Running CheckM2 version 1.0.2
[01/21/2025 01:40:15 PM] INFO: Running quality prediction workflow with 1 threads.
[01/21/2025 01:40:15 PM] ERROR: No bins found. Check the extension (-x) used to identify bins.
ADD REPLY
0
Entering edit mode

I don't mean to sound harsh, but this is pretty basic stuff that is easily resolved if you read through the CheckM2 manual and follow the direction that is already given to you:

Check the extension (-x) used to identify bins.

If you run checkm2 predict -h to get help, it will tell you that a default file extension the program looks for is .fna. All your bins end in .fa, so the program "sees" nothing in that directory. If you add -x fa to your existing command, it will look for files that end in .fa.

By the way, if you also add --threads N where N is the number of threads on your computer, the program will run faster. It helps to read the manual.

ADD REPLY
0
Entering edit mode

Mensur Dlakic apologies on subjecting you to such a trivial troubleshooting error. Hitting myself on the head because I was thinking fa and fna were the same thing. The option worked so I very much appreciate the help. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6