what is in the fasta.fai
1
8
Entering edit mode
10.6 years ago
mad.cichlids ▴ 140

Hi, this may sounds a bit trivial question, i indexed my ref genome, along with the indexed genome, there is also a file called .fasta.fai. My question is what is in the fasta.fai? I opened this file, it does not have any header. Did I do anything wrong ?

I then used the ref genome to call SNPs. I assume some sort of position information is used to tell me the position of snps in the ref genome. Which column is position? Thanks.

Here is a subset of the fasta.fai file

gi|394055774|gb|AGTA02000001.1|    5516    93    70    71
gi|394055773|gb|AGTA02000002.1|    2292    5781    70    71
gi|394055772|gb|AGTA02000003.1|    4668    8199    70    71
gi|394055771|gb|AGTA02000004.1|    1190    13027    70    71
bowtie • 38k views
ADD COMMENT
18
Entering edit mode
10.6 years ago
Dan D 7.4k

The fasta.fai is the fasta index, and the one you posted looks legit.

For each row:

Column 1: The contig name. In your FASTA file, this is preceeded by '>'

Column 2: The number of bases in the contig

Column 3: The byte index of the file where the contig sequence begins. (Notice how it constantly increases by roughly the amount in column 2?)

Column 4: bases per line in the FASTA file

Column 5: bytes per line in the FASTA file

ADD COMMENT
0
Entering edit mode

Just out of curious, about information in the column3,

say

gi|394055774|gb|AGTA02000001.1|    5516    93    70    71

has a length of 5516, starts from 93,

would not I expect the next contig approximately starts from the position 5516 + 93 = 5609, in reality it starts from 5781, likewise, the next contig approximately starts with 2292 + 5781 = 8073, but it actually starts with 8199, why is this the case? This is related to the downstream analysis of SNPs, when I looked at the VCF file, each SNP is given the position information column, say gi|394055774|gb|AGTA02000001.1| 277 does that mean the variation site is on the 277th of 5516? Thank you for your explanation.

ADD REPLY
2
Entering edit mode

So the contig names themselves of course take up bytes in the file, but take a look again at columns 4 and 5. Notice how the number in column 5 is one more than column 4? There are 70 bases per line, but 71 bytes. The newline character is one byte long, so that in combination with the contig name explains the apparent discrepancy you're seeing.

On your VCF file, the SNP position is the genomic position (number of bases into the chromosome/contig), and has no direct association to the .fai data.

ADD REPLY
0
Entering edit mode

This helped a lot!

in other words, the SNP position in the genome is really depend on how the contigs are organized or aligned up with other contigs in the genome?

ADD REPLY
3
Entering edit mode

Think of it like this:

Let's say we have a really simple genome represented in a FASTA file, with two contigs:

>contig1
AAAAAATTTTTT
>contig2
CCCCCCGGGGG

And you have a sequence you want to align: AGTTTTT

So you can align it like this to contig1, with a single mismatch:

AAAAAATTTTTT
    AGTTTTT
_____^

The SNP occurs at the seventh base of the 'contig1' sequence. So the VCF file should give you a position value of 7 for that SNP. Make sense?

ADD REPLY
0
Entering edit mode

Make a lot of sense now. Thank you so much.

ADD REPLY

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6