Is anyone able to explain to me the differences between using BWA to index your reference genome for aligning a FASTQ file compared to indexing the reference using the faidx command in Samtools after you've converted a sam file to bam format. I have used some commands with success in Samtools but am actually struggling to understand what the formatting steps do. Specifically, I don't understand what .fai files are and what sorting and indexing the .bam file does.
If any is able to help me understand I would be very grateful.
Many thanks
Thank you for directing me to the previous post.
I would still like to learn the purpose for this indexing. Why is it necessary to make a ref.fai file?
I have used BWA to make a SAM file then used this command in samtools to create a bam file:
I have then seen protocols which start from this point by making the ref.fai file then converting .sam to .bam as follows:
Is this normal? As I thought the conversion had been performed, so what are the steps with .fai files for?
The protocol then sorts and indexes the bam file. Can the sorting and indexing be done following the first sam to bam conversion?
Thank you
You only need to index a fasta file if you need to random access to sequence that's in the file. Otherwise, it serves little purpose.
The instances you've seen with
samtools import
are incredibly old and should not be used. Ignore them. The purpose of the fai file in those cases was to act as a substitute for a possibly missing header in the SAM file. Unless your file is missing a header, then there's absolutely no need to include the fasta index (btw, the samtools view version of that is the-t
option).BTW, you can just pipe everything together:
You can also pipe the output of your aligner to that to avoid the useless SAM file altogether.