I am really impressed with the speed increase in the GPU-enabled read mapper, Arioc.
However, I am finding a discrepancy between the length (nucleotides) of the input FASTA records (reference genome, whether multifasta or single fasta files), and the reported length of the same records after Arioc encoding. This is preventing use of the ultimate SAM/BAM files in downstream applications (e.g. GATK).
I can run the Scerevisiae example files as provided with the Arioc download, and the reported lengths are correct. I have used these example .cfg files as a strict template with my own FASTA files, but each of the FASTA records in the output shows the same (truncated) length of 10485759. I have also tried many other configurations, but all give the same LN=10485759.
Is 10485759 the maximum length of FASTA record that can be read? Has anyone else encountered this problem?
My input fasta files seem pretty standard, and can be read correctly by many other programs.
Here are lengths of the input records (in nucleotides):
Chr01 215687109
Chr02 188126098
Chr03 185291080
Chr04 165120918
Chr05 191020454
Chr06 195786439
Chr07 160739793
Chr08 226883875
Chr09 211202930
Chr10 184451305
Chr11 182988052
Chr12 176693890
Chr13 163306629
Chr14 158828433
and the output (.cfg) file
<?xml version="1.0" encoding="UTF-8"?>
<SAM fn="hsi20_0_30">
<HD VN="1.6"/>
<SQ srcId="0" subId="001" rm="Chr01" UR="" LN="10485759" AS="S288C" M5="7ed4be27dbb7bf131f73730e8afe875f" SN="Chr01"/>
<SQ srcId="0" subId="002" rm="Chr02" UR="" LN="10485759" AS="S288C" M5="6c44c5d5c83d9678b3983047bdba5778" SN="Chr02"/>
<SQ srcId="0" subId="003" rm="Chr03" UR="" LN="10485759" AS="S288C" M5="8d1130af9c660807090cc2a07ce38dea" SN="Chr03"/>
<SQ srcId="0" subId="004" rm="Chr04" UR="" LN="10485759" AS="S288C" M5="851abd8f550924d33f914215c46c37fc" SN="Chr04"/>
<SQ srcId="0" subId="005" rm="Chr05" UR="" LN="10485759" AS="S288C" M5="f61292522bc376c2d306b14e11fc4bc1" SN="Chr05"/>
<SQ srcId="0" subId="006" rm="Chr06" UR="" LN="10485759" AS="S288C" M5="5b50426ce0a09437abbd424bc3ea08f9" SN="Chr06"/>
<SQ srcId="0" subId="007" rm="Chr07" UR="" LN="10485759" AS="S288C" M5="8fdbf362f722ef81e7c89c4d1a165474" SN="Chr07"/>
<SQ srcId="0" subId="008" rm="Chr08" UR="" LN="10485759" AS="S288C" M5="f95125c51c6f00ac4ac16215f6636fb8" SN="Chr08"/>
<SQ srcId="0" subId="009" rm="Chr09" UR="" LN="10485759" AS="S288C" M5="3733588cc77e79e2a73cd2af4c7b5059" SN="Chr09"/>
<SQ srcId="0" subId="010" rm="Chr10" UR="" LN="10485759" AS="S288C" M5="9500cde51e37d1e7c09a17403b38f9d4" SN="Chr10"/>
<SQ srcId="0" subId="011" rm="Chr11" UR="" LN="10485759" AS="S288C" M5="e4ac83591c85946aaa91fef9f5e78179" SN="Chr11"/>
<SQ srcId="0" subId="012" rm="Chr12" UR="" LN="10485759" AS="S288C" M5="c1abdb1d942a8deafb1eb04111ea28d3" SN="Chr12"/>
<SQ srcId="0" subId="013" rm="Chr13" UR="" LN="10485759" AS="S288C" M5="a213ea02435b2da8aec958f10324d86c" SN="Chr13"/>
<SQ srcId="0" subId="014" rm="Chr14" UR="" LN="10485759" AS="S288C" M5="d0e441107536881d402aae13edc47e30" SN="Chr14"/>
<PG ID="AriocE (hsi20_0_30)" PN="AriocE" VN="1.52.3149.25006" CL="/home/michdeyh/250324_Calaug/AriocE.gapped.cfg" dt="2025-03-23T19:52:02" ms="149637" mJ="*"/>
</SAM>
A quick look at the example configuration file for
AriocE
appears to show that you don't need to provide lengths of sequence (which the aligner should figure out on its own). Not sure what you are doing above.Thanks for looking into that!
Yes, the aligner does indeed calculate the lengths. FYI, the .cfg file I posted above was produced by the aligner -- it makes its own .cfg files as part of the process -- so I am showing how it is somehow miscalculating the actual length of the contigs. Cheers!