How to use 0 and 1 coordinates in pyfaidx
1
0
Entering edit mode
6.3 years ago
evcon ▴ 10

Im using pyfaidx to get genomic sequences, but I'm a little confused on the 0 vs 1 based coordinates. From the docs "Note that start and end coordinates of Sequence objects are [1, 0]. This can be changed to [0, 0] by passing one_based_attributes=False to Fasta or Faidx" So does that mean if I want base numbers 50-60 in the genome it would be [50,61]?.

pyfaidx python • 2.1k views
ADD COMMENT
0
Entering edit mode
6.3 years ago

Hello evcon,

but I'm a little confused on the 0 vs 1 based coordinates

welcome to bioinformatics :) You're not alone. genomax posted already a link to a brief introductory.

So does that mean if I want base numbers 50-60 in the genome it would be [50,61]?.

Short version: No, you need [49:60].

Long version: Whenever we read a sequence interval in bioinformatics, we have to ask these two questions:

  1. Do we start counting our bases with 0 or 1?
  2. Are the boundaries in the given interval included or not?

The most common versions are:

1-based closed interval [1,1]:

We start counting with 1 and the boundaries are included. So if we have an 1-based interval [5,8] we get the 5th, 6th, 7th and 8th base of our sequence. A file format where this is used is vcf.

0-based half-open interval [0,0[:

We start counting with 0, the start boundary is included, the end boundary not. When using 0-based counting I prefer talking about index and not position. So if we have an 0-base interval [5:8[ we get the 6th (having index 5), 7th (having index 6) and 8th (having index 7) base of the sequence. A file format where this is used is bed.

0-based half-open intervals are also used by python when slicing list or strings. pyfaidx uses slicing to fetch the sequence. This is why you have to write sequence['seq_id'][49:60] to get the base numbers 50-60.

Note that start and end coordinates of Sequence objects are [1, 0]. This can be changed to [0, 0] by passing one_based_attributes=False to Fasta or Faidx.

This info in the docs goes on:

This argument only affects the Sequence .start/.end attributes, and has no effect on slicing coordinates.

And this is important! When you slice a sequence with pyfaidx, you will always do it with 0-based half-open intervals. Beside the sequence, pyfaidx returns some more values, like the mentioned start and stop attributes. And here pyfaidx introduces more confusion, as the positions seems to be a mixture of 1-based and 0-based.

Some time ago I moved from pyfaidx to pysam. Here you can simply choose between the two common version:

import pysam
seq = pysam.FastaFile("input.fa")

#0-based half-open
seq.fetch("seq_id", 49, 60)

#1-based closed
seq.fetch(region="seq_id:50-60")

fin swimmer

ADD COMMENT

Login before adding your answer.

Traffic: 1677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6