Blat Start And End Position Conventions
2
1
Entering edit mode
12.7 years ago
Tianyang Li ▴ 500

Hi,

I have a question about the conventions of start and end positions for BLAT's output.

For example, if I have q.start 0 q.end 1, is it only just 1 nucleotide? Or is it 2 nucleotides?

Also, is the same convetion used for different output formats (blast, blast8, blast9).

Thanks!

blat position • 5.9k views
ADD COMMENT
3
Entering edit mode
12.7 years ago
JC 13k

from: http://genome.ucsc.edu/goldenPath/help/blatSpec.html

In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.

ADD COMMENT
3
Entering edit mode
12.7 years ago
Vikas Bansal ★ 2.4k

Hi,

I think blat psl file is “zero based half open.” It means first nucleotide of sequence is 0. Example-

 12345    actual position
 AGCTG    query sequence
 01234   coordinates for psl

So if psl file says q.start is 0 and q.end is 1, it means only 1 nucleotide is matched because q.end is 1 less than the given coordinate. A good example is given here. I am just pasting the example and comments from the link.

In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.

Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).

0         1         2         3 tens position in query
0123456789012345678901234567890 ones position in query
            ++++          +++++ plus strand alignment on query
    --------    ----------      minus strand alignment on query
Plus strand:
     qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
     qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19
Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
     qStart = qSize - revQEnd
     qEnd = qSize - revQStart

Please note that there is difference when query matches on reverse strand.

ADD COMMENT
0
Entering edit mode

Excuse me, could you please be more specific in how the qStarts are calculated in reverse strand? (I am talking about the 5,19). Also, how it can be the end of the minus start 26? Isn't it 25? Thank you very much in advance.

ADD REPLY
0
Entering edit mode

I find this post to always be helpful when I am dealing with this strange coordinate system (along with this great answer by Vikas!)

ADD REPLY

Login before adding your answer.

Traffic: 2207 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6