Question

Why do we need a .fai file and a .dict file of the reference during alignment and variant calling using GATK?

3

Entering edit mode

6.1 years ago

VBer ▴ 210

I'm trying to learn the theory behind various steps in variant calling using GATK. Before alignment using BWA-MEM we first index the reference genome and this generates a set of files with the extensions

chr13and17.fa.amb chr13and17.fa.ann chr13and17.fa.bwt chr13and17.fa.pac chr13and17.fa.sa

where chr13and17.fa is the FASTA file containing the reference genome.

The next step in the pipeline is generating a .fai using samtools with the command:

samtools faidx chr13and17.fa

Followed by generating a .dict file using Picard:

java -jar picard.jar CreateSequenceDictionary R=chr13and17.fa O=chr13and17.dict

I want to know WHY we generate a .fai file and a .dict file despite also indexing the genome. In the samtools manual, the reason for creating a .fai file is specified as:

Using an fai index file in conjunction with a FASTA/FASTQ file containing reference sequences enables efficient access to arbitrary regions within those reference sequences.

Isn't 'efficient access to arbitrary regions of the genome' also the aim of indexing? I understand the files themselves store different information in different, well, formats. But why all the different files though?

next-gen sequencing GATK alignment file_formats • 7.1k views

ADD COMMENT • link updated 6.1 years ago by Pierre Lindenbaum 166k • written 6.1 years ago by VBer ▴ 210

1

Entering edit mode

AFAIK only the GATK and Picard tools need the dict files. You're right, fai and dict are both index files in a manner of speaking, but they are optimized for different functions, and while fai is more prevalent, Picard tools are tooled to work better with dict files. Check out this thread on a similar topic: .dict file created by picard and by samtools

About all the different files that you encounter, here's my take: Welcome to Bioinformatics :-)

ADD REPLY • link 6.1 years ago by Ram 45k

score 5 · Accepted Answer · 2019-04-19

index for bwa-mem : burrow-wheeler transform index used to map the reads.

index fai : used by the tool to list the chromosome and quickly fetch a sequence from the fasta sequence

dict: list the chromosomes but also provides informations about the MD5 Sum of the fasta sequences (to be sure that you're using the same REF), the name of the organism(s), the names for aliases, the URL where we can retrieve the sequences, etc... this dict file will be inserted in/compared with the BAM and VCF headers