Hello evcon,
but I'm a little confused on the 0 vs 1 based coordinates
welcome to bioinformatics :) You're not alone. genomax posted already a link to a brief introductory.
So does that mean if I want base numbers 50-60 in the genome it would
be [50,61]?.
Short version: No, you need [49:60].
Long version: Whenever we read a sequence interval in bioinformatics, we have to ask these two questions:
- Do we start counting our bases with 0 or 1?
- Are the boundaries in the given interval included or not?
The most common versions are:
1-based closed interval [1,1]:
We start counting with 1 and the boundaries are included. So if we have an 1-based interval [5,8] we get the 5th, 6th, 7th and 8th base of our sequence. A file format where this is used is vcf
.
0-based half-open interval [0,0[:
We start counting with 0, the start boundary is included, the end boundary not. When using 0-based counting I prefer talking about index and not position. So if we have an 0-base interval [5:8[ we get the 6th (having index 5), 7th (having index 6) and 8th (having index 7) base of the sequence. A file format where this is used is bed
.
0-based half-open intervals are also used by python when slicing list or strings. pyfaidx
uses slicing to fetch the sequence. This is why you have to write sequence['seq_id'][49:60]
to get the base numbers 50-60.
Note that start and end coordinates of Sequence objects are [1, 0].
This can be changed to [0, 0] by passing one_based_attributes=False to
Fasta or Faidx.
This info in the docs goes on:
This argument only affects the Sequence .start/.end attributes, and
has no effect on slicing coordinates.
And this is important! When you slice a sequence with pyfaidx
, you will always do it with 0-based half-open intervals. Beside the sequence, pyfaidx
returns some more values, like the mentioned start
and stop
attributes. And here pyfaidx introduces more confusion, as the positions seems to be a mixture of 1-based and 0-based.
Some time ago I moved from pyfaidx
to pysam. Here you can simply choose between the two common version:
import pysam
seq = pysam.FastaFile("input.fa")
#0-based half-open
seq.fetch("seq_id", 49, 60)
#1-based closed
seq.fetch(region="seq_id:50-60")
fin swimmer
Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems