Question

should FASTA files be sorted before indexed with SAMtools?

1

Entering edit mode

6.9 years ago

Makplus T ▴ 100

Hello dears,

We can index the files to access random sequences fast, and it should be sorted before indexed for the SAM/BAM.

But when it turns to FASTA files, I could not understand why we can directly index without sort first ?

 samtools faidx <ref.fa>

Are there any easy explanation? THANKS

SAMtools FASTA sort index • 13k views

ADD COMMENT • link updated 6.9 years ago by finswimmer 16k • written 6.9 years ago by Makplus T ▴ 100

score 5 · Accepted Answer · 2018-09-25

Hello Timze W ,

indexing is a very fascinating topic.

The index file produced by samtools faidx and .bai have very different structures. I guess there are mainly two reasons for it:

In a bamfile we have typically much more entrys than in a fasta file.
The way we query the data. For a fasta file we typically ask "Give me the sequence with the id XY". For bam files we ask "Give me all reads that overlap a region"

The fasta index is quite simply. It just contains the name of sequences, where in our file the header starts, how long the header is and how much bases the sequence have. See the specs for it. As the number of sequence in a fasta file is quite small (compared to a bam file) we can just iterate over the index file to find the offset of a sequence we like to have in a reasonable time.

In case of the bam the index file is organized in bins, which contains the offset of reads that overlap a region. See the sam specs for it . To be able to say where a bin begins and end it is necessary to sort the bam file.

fin swimmer