Hello,
I am currently working on a de novo large genome assembly. now I want to assess the quality of my reconstructed genome. I saw that there are several suitable programs. I am trying to use ALE a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies. I pre-aligned the reads on scaffold with Bowtie2 and bwa. My reads are in the format SAM.
In the 2 cases, with this command I have an error:
./ALE \
/data/DataSet/DeNovo/Softs/pipeline/temp/donneestest/Validation/ABYSS-scaffolds_bowtie2.sam \
/data/DataSet/DeNovo/Softs/pipeline/temp/donneestest/Assembly/ABySS_Assembly/ABYSS-scaffolds.fa \
/data/DataSet/DeNovo/Softs/pipeline/temp/donneestest/Validation/ABYSS_scaffolds_ALE.txt
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
Checking if /data/DataSet/DeNovo/Softs/pipeline/temp/donneestest/Validation/ABYSS-scaffolds_bowtie2_step2.sam is a SAM formatted file, instead of BAM
[samopen] SAM header is present: 3320 sequences.
Reading in assembly...
Found 6 ambiguous bases (excluding N) in the assembly.
Reading in the map and computing statistics...
Insert length and std not given, will be calculated from input map.
Read 1000000 reads...
Setting library to be sorted by name (647052 new sequential names vs 1294104 reads)
Found FR sample avg insert length to be 173.244532 from 887964 mapped reads
Found FR sample insert length std to be 19.926324
There were 1294104 total reads, 1294104 paired (923164 properly mated), 41431 proper singles, 329509 improper reads (3592 chimeric). (324647 reads were unmapped)
Saved library parameters to /data/DataSet/DeNovo/Softs/pipeline/temp/donneestest/Validation/ABYSS_scaffolds_ALE.txt.param
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
Checking if /data/DataSet/DeNovo/Softs/pipeline/temp/donneestest/Validation/ABYSS-scaffolds_bowtie2_step2.sam is a SAM formatted file, instead of BAM
[samopen] SAM header is present: 3320 sequences.
Computing read placements and depths
MD mismatch but it does not match! SRR022868.11196 89: refpos 442 MDpos 14: 'K' vs 'N'
Abandon
I do not know if anyone has an idea of the origin of the problem and its solution.
In seeking I saw that some people had a similar problem with samtools. it would seem that it is a problem in the file sam. But I do not see why and how the SAM file may not be correct.
cordially
Your solution works, thank you!
A. GUYOMARD