I'm trying to learn the theory behind various steps in variant calling using GATK. Before alignment using BWA-MEM we first index the reference genome and this generates a set of files with the extensions
chr13and17.fa.amb
chr13and17.fa.ann
chr13and17.fa.bwt
chr13and17.fa.pac
chr13and17.fa.sa
where chr13and17.fa is the FASTA file containing the reference genome.
The next step in the pipeline is generating a .fai using samtools with the command:
samtools faidx chr13and17.fa
Followed by generating a .dict file using Picard:
java -jar picard.jar CreateSequenceDictionary
R=chr13and17.fa
O=chr13and17.dict
I want to know WHY we generate a .fai file and a .dict file despite also indexing the genome. In the samtools manual, the reason for creating a .fai file is specified as:
Using an fai index file in conjunction with a FASTA/FASTQ file containing reference sequences enables efficient access to arbitrary regions within those reference sequences.
Isn't 'efficient access to arbitrary regions of the genome' also the aim of indexing? I understand the files themselves store different information in different, well, formats. But why all the different files though?
AFAIK only the GATK and Picard tools need the
dict
files. You're right,fai
anddict
are both index files in a manner of speaking, but they are optimized for different functions, and whilefai
is more prevalent, Picard tools are tooled to work better with dict files. Check out this thread on a similar topic: .dict file created by picard and by samtoolsAbout all the different files that you encounter, here's my take: Welcome to Bioinformatics :-)